<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: benzsevern</title>
    <description>The latest articles on DEV Community by benzsevern (@benzsevern).</description>
    <link>https://dev.to/benzsevern</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG</url>
      <title>DEV Community: benzsevern</title>
      <link>https://dev.to/benzsevern</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/benzsevern"/>
    <language>en</language>
    <item>
      <title>Reconciling 15 OSS Vulnerability Databases: What They Actually Cover</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Thu, 09 Apr 2026 22:22:05 +0000</pubDate>
      <link>https://dev.to/benzsevern/reconciling-15-oss-vulnerability-databases-what-they-actually-cover-19fl</link>
      <guid>https://dev.to/benzsevern/reconciling-15-oss-vulnerability-databases-what-they-actually-cover-19fl</guid>
      <description>&lt;p&gt;If you run an open source project, you probably rely on a vulnerability scanner that queries one or two databases. Dependabot looks at GitHub Security Advisories. &lt;code&gt;pip-audit&lt;/code&gt; looks at PyPA. &lt;code&gt;cargo audit&lt;/code&gt; looks at RustSec. Each tool has an opinion about what counts as a known vulnerability, and those opinions only partially overlap.&lt;/p&gt;

&lt;p&gt;I wanted to know, concretely, what the overlap looks like. Not "Dependabot is good" or "OSV is comprehensive" — actual numbers. So I did the same thing I did &lt;a href="https://dev.to/blog/2026-04-09-wallet-attribution-13m-records"&gt;last week for blockchain attribution data&lt;/a&gt;: pointed one entity-resolution pipeline at every public vulnerability database I could download for free and let the union-find speak.&lt;/p&gt;

&lt;p&gt;The answer is 869,771 records across 15 sources, collapsing to 608,463 canonical vulnerabilities. That reconciliation surfaces three findings I did not go looking for, and one of them changed how I think about OSS dependency scanning.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fifteen sources
&lt;/h2&gt;

&lt;p&gt;Every one of these publishes bulk exports, under permissive licenses, without an API key:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Records&lt;/th&gt;
&lt;th&gt;What it covers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://osv.dev" rel="noopener noreferrer"&gt;OSV.dev&lt;/a&gt; (10 ecosystem bulks)&lt;/td&gt;
&lt;td&gt;519,760&lt;/td&gt;
&lt;td&gt;PyPI, npm, Go, Maven, RubyGems, crates.io, Packagist, NuGet, Debian, Alpine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/github/advisory-database" rel="noopener noreferrer"&gt;GitHub Advisory Database&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;350,164&lt;/td&gt;
&lt;td&gt;28,618 reviewed + 297,078 unreviewed mirrors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/pypa/advisory-database" rel="noopener noreferrer"&gt;PyPA advisory-database&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3,230&lt;/td&gt;
&lt;td&gt;Python Packaging Authority curated vulns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/golang/vulndb" rel="noopener noreferrer"&gt;Go vulnerability DB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3,079&lt;/td&gt;
&lt;td&gt;Go modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/rustsec/advisory-db" rel="noopener noreferrer"&gt;RustSec advisory-db&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1,022&lt;/td&gt;
&lt;td&gt;Rust crates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.first.org/epss/" rel="noopener noreferrer"&gt;EPSS&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;~326,000&lt;/td&gt;
&lt;td&gt;Exploit prediction scores per CVE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total records ingested&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;869,771&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things to notice about this list. First, &lt;strong&gt;OSV and GHSA dominate&lt;/strong&gt; — between them they account for 870k of the 870k. The smaller ecosystem-specific databases (PyPA, RustSec, Go vulndb) are curated subsets that cover at most a few thousand entries each but often with higher-quality metadata. Second, &lt;strong&gt;GHSA splits internally&lt;/strong&gt; into "reviewed" (28k — the set GitHub's security team actually touches) and "unreviewed" (297k — a passthrough mirror of NVD filtered to packages GitHub tracks). That split is going to matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The schema and the join
&lt;/h2&gt;

&lt;p&gt;I projected every source to a nine-column row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vuln_id    aliases   ecosystem   package   purl   published   modified   severity   source
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;vuln_id&lt;/code&gt; is the primary identifier that source uses — a GHSA-xxxx, CVE-xxxx, PYSEC-xxxx, RUSTSEC-xxxx, GO-xxxx, or MAL-xxxx. &lt;code&gt;aliases&lt;/code&gt; is a semicolon-joined list of cross-database identifiers the source knows about. &lt;code&gt;purl&lt;/code&gt; is the &lt;a href="https://github.com/package-url/purl-spec" rel="noopener noreferrer"&gt;Package URL&lt;/a&gt; — a canonical string like &lt;code&gt;pkg:pypi/tensorflow&lt;/code&gt; or &lt;code&gt;pkg:maven/io.grpc/grpc-protobuf&lt;/code&gt; that uniquely names a package across every public ecosystem.&lt;/p&gt;

&lt;p&gt;The useful insight for the ER work is that &lt;strong&gt;OSV's &lt;code&gt;aliases&lt;/code&gt; field is a partial ground truth for the reconciliation pipeline&lt;/strong&gt;. An OSV entry for &lt;code&gt;GHSA-gcx2-gvj7-pxv3&lt;/code&gt; might say &lt;code&gt;aliases: [CVE-2022-24766, PYSEC-2022-170]&lt;/code&gt;. A separate entry in the PyPA database for &lt;code&gt;PYSEC-2022-170&lt;/code&gt; says &lt;code&gt;aliases: [GHSA-gcx2-gvj7-pxv3, CVE-2022-24766]&lt;/code&gt;. The alias graph is mostly pre-computed — the ER pipeline's job is to walk it transitively and catch the cases where it isn't.&lt;/p&gt;

&lt;p&gt;That's a union-find. I pointed one at the (vuln_id, aliases) pair for every row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;ra&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ra&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;rb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rb&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ra&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iter_rows&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;named&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;vid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vuln_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aliases&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Forty lines of code, finishes in under a second on 616,237 distinct identifiers. After the compaction pass the pipeline has &lt;strong&gt;608,463 canonical vulnerability clusters&lt;/strong&gt;. Of those, &lt;strong&gt;345,568 (57%)&lt;/strong&gt; collapsed two or more distinct identifiers — meaning more than half of every canonical vulnerability in the free public data carries a cross-database alias.&lt;/p&gt;

&lt;p&gt;That's a much denser ER signal than the blockchain dataset from last week. The clusters are smaller on average (most have 2-3 IDs, not 10-45) but the ratio of "records that participate in multi-ID resolution" is dramatically higher. OSS security data is deliberately cross-linked; blockchain attribution data is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 1: GitHub reviews 9.1% of what it ingests
&lt;/h2&gt;

&lt;p&gt;Here is the headline number, and here is why I want to be careful about it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Set&lt;/th&gt;
&lt;th&gt;Canonical clusters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full OSS vulnerability universe (union of all sources)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;312,250&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;github-reviewed&lt;/code&gt; (GitHub security team curated)&lt;/td&gt;
&lt;td&gt;28,419 (&lt;strong&gt;9.1%&lt;/strong&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;github-unreviewed&lt;/code&gt; (NVD mirror filtered to tracked packages)&lt;/td&gt;
&lt;td&gt;297,076 (95.1%)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OSV across all ecosystems (any)&lt;/td&gt;
&lt;td&gt;312,098 (99.95%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;9.1% is the percentage of the full free OSS vulnerability universe that ends up in GitHub's reviewed advisory set — the one the GitHub security team actually curates, enriches, and writes human-readable metadata for. The other 91% passes through GHSA as unreviewed CVE mirrors.&lt;/p&gt;

&lt;p&gt;I want to flag this next part explicitly, because it is the kind of number that is easy to misrepresent. &lt;strong&gt;This is not "Dependabot misses 91% of vulnerabilities."&lt;/strong&gt; Dependabot consumes both the reviewed and unreviewed GHSA sets, so in terms of &lt;em&gt;raw ID awareness&lt;/em&gt; its coverage is much closer to the full universe. What the 91% number actually measures is the &lt;strong&gt;curation ratio&lt;/strong&gt;: out of every hundred OSS vulnerability IDs that flow through GitHub's advisory pipeline, only about nine get the human review, the summary rewrite, the CWE assignment, the affected-versions normalization, the severity validation.&lt;/p&gt;

&lt;p&gt;So the accurate framing is: &lt;em&gt;most of what Dependabot shows you is passthrough data. Nine percent of it has been curated by a human on GitHub's security team.&lt;/em&gt; That's still interesting — most developers do not know their tool is 91% passthrough — but it is a statement about metadata quality, not a statement about coverage.&lt;/p&gt;

&lt;p&gt;For the record: &lt;code&gt;github-reviewed&lt;/code&gt; overlaps heavily with the per-ecosystem curated sets. PyPA, RustSec, and Go vulndb are all disjoint enrichment paths that contribute a few thousand high-quality entries each. If you point one tool at all of them, your &lt;em&gt;curated&lt;/em&gt; coverage roughly doubles. If you point one tool at the whole public universe, your &lt;em&gt;passthrough&lt;/em&gt; coverage goes to 99%. Most tools do neither.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2: The JavaScript ecosystem has more tracked vulnerabilities than everything else combined
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Ecosystem&lt;/th&gt;
&lt;th&gt;Canonical vulns&lt;/th&gt;
&lt;th&gt;Ratio to npm&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;npm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;217,162&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.00×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debian (4 active releases combined)&lt;/td&gt;
&lt;td&gt;~160,000&lt;/td&gt;
&lt;td&gt;0.74×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyPI&lt;/td&gt;
&lt;td&gt;15,920&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.07×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maven&lt;/td&gt;
&lt;td&gt;6,370&lt;/td&gt;
&lt;td&gt;0.03×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packagist (PHP)&lt;/td&gt;
&lt;td&gt;5,571&lt;/td&gt;
&lt;td&gt;0.03×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;3,627&lt;/td&gt;
&lt;td&gt;0.02×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Alpine (10+ versions combined)&lt;/td&gt;
&lt;td&gt;~25,000&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RubyGems&lt;/td&gt;
&lt;td&gt;1,988&lt;/td&gt;
&lt;td&gt;0.009×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NuGet (.NET)&lt;/td&gt;
&lt;td&gt;1,653&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.008×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;crates.io&lt;/td&gt;
&lt;td&gt;1,396&lt;/td&gt;
&lt;td&gt;0.006×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;npm has 14× more tracked vulnerabilities than PyPI and 131× more than NuGet.&lt;/strong&gt; I want to be careful here too. There are at least three reasonable explanations for why these numbers look the way they do, and the data cannot distinguish between them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;npm has a much larger surface area.&lt;/strong&gt; The JavaScript ecosystem has more packages, more transitive dependencies per package, more maintainers, and more velocity. A bigger numerator is expected.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm gets much more adversarial attention.&lt;/strong&gt; Typo-squatting campaigns, malicious packages, and coordinated supply chain attacks target npm disproportionately because it's where the blast radius is largest. More attention finds more bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Other ecosystems get less scrutiny.&lt;/strong&gt; NuGet has 1,653 reported vulnerabilities across all of public .NET. That number is suspiciously small for an ecosystem that has run enterprise backends for two decades. Either .NET is miraculously clean or nobody is looking.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The honest read is that all three are partly true. The 130× gap between npm and NuGet is not a claim that npm is 130× less safe — it is a claim that the free public vulnerability-visibility stack is 130× more attentive to npm. If you are a .NET developer relying entirely on free tools, your observable attack surface is smaller than your actual one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3: The free OSS stack is structurally blind to system-level vulnerabilities
&lt;/h2&gt;

&lt;p&gt;This is the finding I did not go looking for, and it is the one that will stick with me. I wrote a small section in the analyzer that looks up half a dozen famous vulnerabilities by CVE ID and dumps the cluster they resolve to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;famous&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Log4Shell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2021-44228&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spring4Shell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2022-22965&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Heartbleed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2014-0160&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Shellshock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2014-6271&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ProxyShell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2021-34473&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ZipSlip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CVE-2018-1002105&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Half of these resolve beautifully:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vuln&lt;/th&gt;
&lt;th&gt;Cluster sources&lt;/th&gt;
&lt;th&gt;Ecosystems&lt;/th&gt;
&lt;th&gt;Affected packages&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log4Shell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ghsa-reviewed + osv-Maven&lt;/td&gt;
&lt;td&gt;Maven&lt;/td&gt;
&lt;td&gt;5 log4j-derivative packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spring4Shell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ghsa-reviewed + osv-Maven&lt;/td&gt;
&lt;td&gt;Maven&lt;/td&gt;
&lt;td&gt;5 Spring packages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ZipSlip&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ghsa-reviewed + go-vulndb + osv-Go&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;&lt;code&gt;github.com/kubernetes/kubernetes&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Log4Shell's cluster correctly identifies &lt;code&gt;org.apache.logging.log4j:log4j-core&lt;/code&gt; plus four derivative wrappers (&lt;code&gt;com.guicedee.services:log4j-core&lt;/code&gt;, &lt;code&gt;org.ops4j.pax.logging:pax-logging-log4j2&lt;/code&gt;, etc.). If you were writing a Maven SBOM scanner, the ER pipeline has just done most of your work.&lt;/p&gt;

&lt;p&gt;The other three resolve to nothing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vuln&lt;/th&gt;
&lt;th&gt;Cluster sources&lt;/th&gt;
&lt;th&gt;Ecosystems&lt;/th&gt;
&lt;th&gt;Affected packages&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Heartbleed&lt;/strong&gt; (CVE-2014-0160)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ghsa-unreviewed only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Shellshock&lt;/strong&gt; (CVE-2014-6271)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ghsa-unreviewed only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;ProxyShell&lt;/strong&gt; (CVE-2021-34473)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ghsa-unreviewed only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;none&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Heartbleed is in the data. It has a CVE ID. It exists in the GHSA unreviewed mirror. But its cluster has &lt;strong&gt;no ecosystem tag and no affected package&lt;/strong&gt;. None of the curated sources — not PyPA, not RustSec, not Go vulndb, not any OSV ecosystem bucket — has Heartbleed attached to a single package. Same story for Shellshock. Same story for ProxyShell.&lt;/p&gt;

&lt;p&gt;Why? Because OpenSSL, bash, and Microsoft Exchange Server are not distributed through managed package ecosystems. OpenSSL ships as a C library bundled into operating system images, container base layers, Python wheels via &lt;code&gt;cryptography&lt;/code&gt;, Node.js builds, and about a thousand other places that do not go through npm or PyPI. Bash ships as a distro package. Exchange ships as an installer. None of them have a PURL. None of them have a declarable version range in a &lt;code&gt;requirements.txt&lt;/code&gt;. &lt;strong&gt;Package-level scanners cannot see them by construction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a structural property of how the free OSS vulnerability tooling stack is wired. The scanners that developers actually run — Dependabot, &lt;code&gt;pip-audit&lt;/code&gt;, &lt;code&gt;cargo audit&lt;/code&gt;, &lt;code&gt;npm audit&lt;/code&gt;, Snyk's free tier — all resolve vulnerabilities against package manifests. If the vulnerability is in a system library, the manifest does not reference it, and the scanner is silent.&lt;/p&gt;

&lt;p&gt;The next Heartbleed will not be detected by any of these tools. Not because the databases don't know about it — Heartbleed itself is in all of them — but because the thing doing the matching is asking the wrong question. It's asking "which of my declared packages is affected?" when it should be asking "which of the binaries actually installed on this machine is affected?" That is a completely different pipeline, and it lives in tools like Trivy, Grype, and Syft that do container image scanning. Most developers do not run those tools.&lt;/p&gt;

&lt;p&gt;I did not expect ER to find this. I was looking for cross-database name disagreements and got handed a structural blind spot instead. The entity-resolution pipeline made it obvious because it projects every source to the same &lt;code&gt;(ecosystem, package)&lt;/code&gt; key — and when Heartbleed consistently projects to &lt;code&gt;(none, none)&lt;/code&gt;, the null result is loud.&lt;/p&gt;

&lt;h2&gt;
  
  
  What else is in the data
&lt;/h2&gt;

&lt;p&gt;A few secondary findings that do not need their own sections:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The highest-ID-count clusters are Bitnami container fanout.&lt;/strong&gt; The top of the disagreement list is dominated by entries like &lt;code&gt;GHSA-4xp2-w642-7mcx&lt;/code&gt;, which has ten IDs: &lt;code&gt;BIT-cilium-2023-41333&lt;/code&gt;, &lt;code&gt;BIT-cilium-operator-2023-41333&lt;/code&gt;, &lt;code&gt;BIT-cilium-proxy-2023-41333&lt;/code&gt;, &lt;code&gt;BIT-hubble-2023-41333&lt;/code&gt;, &lt;code&gt;BIT-hubble-relay-2023-41333&lt;/code&gt;, &lt;code&gt;BIT-hubble-ui-2023-41333&lt;/code&gt;, plus the root GHSA and CVE. Bitnami's scanner emits one BIT-prefixed identifier per container variant of the same underlying vulnerability. The union-find correctly collapses these, which is a legitimate ER outcome, but it is not the dramatic cross-database name disagreement I was hoping for. The real story is boring: OSV has a known vuln, six Bitnami container images inherit it, and the ID-per-container convention inflates the count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-ecosystem misfiling exists in the raw data.&lt;/strong&gt; While sampling OSV's PyPI ecosystem dump I found &lt;code&gt;GHSA-cfgp-2977-2fmm&lt;/code&gt; — filed in the PyPI directory, but its only affected package is &lt;code&gt;pkg:maven/io.grpc/grpc-protobuf&lt;/code&gt;, a Java gRPC library. If you filter OSV by directory name instead of by PURL, you silently lose vulnerabilities to misfiling. The ER pipeline catches this automatically because it joins on PURL, not on directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EPSS does not change the coverage story.&lt;/strong&gt; Every CVE has an EPSS exploit-prediction score (326k of them), and I pulled the dataset hoping to find that high-EPSS vulns are better covered across databases than low-EPSS ones. They are not, meaningfully. Coverage is a function of which ecosystem the package lives in, not how exploitable the vuln is. That is its own kind of finding but does not carry a post on its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;I want to be precise about what this analysis is and isn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No NVD direct ingestion.&lt;/strong&gt; I pulled NVD via its propagation into GHSA-unreviewed and OSV rather than hitting the REST API directly. That covers most OSS-ecosystem packages but does miss NVD entries that never made it into either mirror. Adding NVD as a 16th source would expose the "pure NVD coverage gap" question but take ~15 minutes of paginated fetching.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Union-find on literal IDs.&lt;/strong&gt; Case-insensitive normalization is not applied. In practice OSV, GHSA, and the curated sources are consistent about identifier format, but this is worth stating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row counts are not vuln counts.&lt;/strong&gt; One advisory that affects three packages emits three rows. The canonical-cluster numbers in this post are distinct counts after ER, not raw rows. Both are in &lt;code&gt;output/report.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No version-range normalization.&lt;/strong&gt; The ER pipeline joins on the &lt;code&gt;(vuln_id, alias)&lt;/code&gt; graph, not on affected versions. This is sufficient for "which databases know about this vulnerability," but not for "is the specific version I have installed affected." Those are different questions and need different pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No commercial database comparison.&lt;/strong&gt; Snyk, Sonatype, Chainguard, Anchore, and JFrog all maintain databases that are richer than anything in this post. None of them are bulk-downloadable without a paid plan. The story here is specifically about the free tier, which is what most individual developers actually use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Blind spot" is strong language.&lt;/strong&gt; The free OSS tooling stack is blind to Heartbleed-class vulnerabilities &lt;em&gt;when invoked as a package-level scanner&lt;/em&gt;. Container scanners like Trivy, Grype, and Syft do look at system libraries. The blind spot is at the specific layer most developers interact with — &lt;code&gt;dependabot&lt;/code&gt; or &lt;code&gt;pip-audit&lt;/code&gt; on a repo — not at the whole ecosystem.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;15 free public databases, 869,771 records, 608,463 canonical vulnerabilities&lt;/strong&gt; after union-find on the cross-database alias graph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Security Advisories reviews about 9.1% of what it ingests.&lt;/strong&gt; Most of what Dependabot surfaces is passthrough NVD data with no curation, no CWE assignment, and no human review. Developers do not usually know this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The JavaScript ecosystem has 14× more tracked vulnerabilities than Python and 131× more than .NET.&lt;/strong&gt; The data cannot tell you whether that is attention, scrutiny, or real exposure — but the asymmetry itself is measured.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Package-level vulnerability scanners cannot see Heartbleed, Shellshock, or ProxyShell.&lt;/strong&gt; Not because the databases don't know — they do — but because these vulnerabilities live in system software with no PURL and no declarable dependency. The free OSS stack is structurally blind to this class by construction. If you care about system-library vulns, run a container scanner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity resolution is the right tool for this question.&lt;/strong&gt; Union-find on the alias graph collapses 57% of canonical vulnerabilities across cross-database identifiers, producing a unified view that no single tool gives you. The blockchain post from last week established the same pattern for a completely different domain; the pipeline is domain-agnostic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;

&lt;p&gt;Everything in this post is in a public repo: &lt;strong&gt;&lt;a href="https://github.com/benzsevern/goldenmatch-vuln-attribution" rel="noopener noreferrer"&gt;benzsevern/goldenmatch-vuln-attribution&lt;/a&gt;&lt;/strong&gt;. Four commands from a fresh clone:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python fetch_public_data.py     &lt;span class="c"&gt;# ~600 MB download, ~5 min&lt;/span&gt;
python count_sources.py         &lt;span class="c"&gt;# diagnostic row count, optional&lt;/span&gt;
python extract_records.py       &lt;span class="c"&gt;# sources → single parquet (~30 sec)&lt;/span&gt;
python analyze.py               &lt;span class="c"&gt;# union-find ER + findings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All six data sources are permissively licensed and redistributable. No API keys. No auth. The full 869k-row analysis finishes in under a minute once the data is local. Outputs land in &lt;code&gt;output/&lt;/code&gt; — &lt;code&gt;report.json&lt;/code&gt; for the headline numbers, &lt;code&gt;famous_vulns.json&lt;/code&gt; for the Log4Shell/Heartbleed/Shellshock clusters, &lt;code&gt;top_disagreement.json&lt;/code&gt; for the Bitnami fanout examples.&lt;/p&gt;

&lt;p&gt;If you want to see the same ER pattern applied to a completely different domain, the companion repo is &lt;strong&gt;&lt;a href="https://github.com/benzsevern/goldenmatch-wallet-attribution" rel="noopener noreferrer"&gt;benzsevern/goldenmatch-wallet-attribution&lt;/a&gt;&lt;/strong&gt; — 13.1 million blockchain attribution records reconciled the same way. Both posts use the same library (&lt;a href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;GoldenMatch&lt;/a&gt;) and the same conceptual pipeline; only the data changes.&lt;/p&gt;

&lt;p&gt;Install GoldenMatch: &lt;code&gt;pip install goldenmatch&lt;/code&gt;. Star the repo: &lt;a href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;benzsevern/goldenmatch&lt;/a&gt;. Try the playground: &lt;a href="https://dev.to/playground"&gt;bensevern.dev/playground&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Reproducibility footer.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source datasets:&lt;/strong&gt; OSV.dev bulk exports (&lt;code&gt;osv-vulnerabilities.storage.googleapis.com&lt;/code&gt;, 10 ecosystems), &lt;code&gt;github/advisory-database&lt;/code&gt; main branch, &lt;code&gt;pypa/advisory-database&lt;/code&gt; main, &lt;code&gt;rustsec/advisory-db&lt;/code&gt; main, &lt;code&gt;golang/vulndb&lt;/code&gt; master, EPSS current scores (&lt;code&gt;epss.empiricalsecurity.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total download:&lt;/strong&gt; ~600 MB of zip archives, read in place via &lt;code&gt;zipfile.ZipFile&lt;/code&gt; (no extraction — NTFS cluster overhead blows up millions of tiny JSON files by two orders of magnitude).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input rows:&lt;/strong&gt; 869,771 across 15 sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unique vuln_ids:&lt;/strong&gt; 616,237.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Canonical vulnerabilities post-ER:&lt;/strong&gt; 608,463. Clusters with 2+ IDs: 345,568. Full OSS universe: 312,250.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;github-reviewed share of full universe:&lt;/strong&gt; 9.1% (28,419 / 312,250).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;goldenmatch&lt;/code&gt; 1.4.4 (conceptual reference, pipeline is union-find + polars for the scale-up), &lt;code&gt;polars&lt;/code&gt; 1.39, &lt;code&gt;pyyaml&lt;/code&gt; 6.0, Python 3.12.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; Windows laptop, 32 GB RAM. Full pipeline completes in under 90 seconds once data is local.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code and raw outputs:&lt;/strong&gt; &lt;a href="https://github.com/benzsevern/goldenmatch-vuln-attribution" rel="noopener noreferrer"&gt;benzsevern/goldenmatch-vuln-attribution&lt;/a&gt; (MIT). Scripts: &lt;code&gt;fetch_public_data.py&lt;/code&gt;, &lt;code&gt;count_sources.py&lt;/code&gt;, &lt;code&gt;extract_records.py&lt;/code&gt;, &lt;code&gt;analyze.py&lt;/code&gt;. Headline JSON: &lt;code&gt;output/report.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data date:&lt;/strong&gt; 2026-04-10.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-10-oss-vulnerability-reconciliation" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>security</category>
    </item>
    <item>
      <title>Wallet Attribution at Scale: ER on 13M Blockchain Records</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Thu, 09 Apr 2026 18:32:45 +0000</pubDate>
      <link>https://dev.to/benzsevern/wallet-attribution-at-scale-er-on-13m-blockchain-records-5g5m</link>
      <guid>https://dev.to/benzsevern/wallet-attribution-at-scale-er-on-13m-blockchain-records-5g5m</guid>
      <description>&lt;p&gt;Every public blockchain attribution dataset is a partial, opinionated view of the same underlying reality. OFAC publishes ~800 sanctioned crypto wallets. Etherscan crowdsources ~50,000 tags across seven EVM chains. Sourcify holds ~14 million verified contract deployments. Forta tracks known-malicious contracts. DeFiLlama catalogs protocol addresses. Israel's Ministry of Defense and the FBI's Lazarus unit each publish their own targeted lists. None of them agree, none of them are complete, and almost none of them talk to each other.&lt;/p&gt;

&lt;p&gt;I wanted to know what happens if you reconcile all of them. Not with a custom schema, not with hand-written joins — with a single entity resolution pipeline pointed at every public source I could find. The answer is 13,147,920 input rows, 30,958 multi-source clusters, and three findings I could not have produced at smaller scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ten sources
&lt;/h2&gt;

&lt;p&gt;I pulled every freely redistributable blockchain attribution dataset I could verify:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Rows&lt;/th&gt;
&lt;th&gt;What it covers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="https://sanctionslistservice.ofac.treas.gov/" rel="noopener noreferrer"&gt;OFAC SDN Enhanced XML&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;788&lt;/td&gt;
&lt;td&gt;US Treasury sanctioned wallets, 18 chains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/brianleect/etherscan-labels" rel="noopener noreferrer"&gt;brianleect/etherscan-labels&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;52,773&lt;/td&gt;
&lt;td&gt;Crowdsourced Etherscan tags, 7 EVM chains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/dawsbot/eth-labels" rel="noopener noreferrer"&gt;dawsbot/eth-labels&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;17,495&lt;/td&gt;
&lt;td&gt;Curated Ethereum mainnet categories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="https://export.sourcify.dev/manifest.json" rel="noopener noreferrer"&gt;Sourcify parquet exports&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;13,062,088&lt;/td&gt;
&lt;td&gt;Verified contract deployments, all chains&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/forta-network/labelled-datasets" rel="noopener noreferrer"&gt;Forta labelled-datasets&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;7,480&lt;/td&gt;
&lt;td&gt;Known malicious contracts + phishing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;a href="https://api.llama.fi/protocols" rel="noopener noreferrer"&gt;DeFiLlama protocols&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;3,332&lt;/td&gt;
&lt;td&gt;Protocol contract addresses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/scamsniffer/scam-database" rel="noopener noreferrer"&gt;ScamSniffer blacklist&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2,530&lt;/td&gt;
&lt;td&gt;Reported scam addresses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/MyEtherWallet/ethereum-lists" rel="noopener noreferrer"&gt;ethereum-lists&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;717&lt;/td&gt;
&lt;td&gt;Dark/light address lists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.opensanctions.org/datasets/il_mod_crypto/" rel="noopener noreferrer"&gt;OpenSanctions: il_mod_crypto&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;684&lt;/td&gt;
&lt;td&gt;Israel MoD sanctioned wallets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.opensanctions.org/datasets/us_fbi_lazarus_crypto/" rel="noopener noreferrer"&gt;OpenSanctions: us_fbi_lazarus_crypto&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;33&lt;/td&gt;
&lt;td&gt;FBI Lazarus Group wallets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13,147,920&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sourcify dominates by two orders of magnitude. Everything else is a long tail of curated, opinionated, high-signal labels. That asymmetry shapes the whole story: Sourcify tells you &lt;em&gt;what addresses exist&lt;/em&gt;, the other nine tell you &lt;em&gt;what they mean&lt;/em&gt;, and entity resolution is what turns one into the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The schema
&lt;/h2&gt;

&lt;p&gt;Every source projects to five columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;COMMON_COLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;address_norm&lt;/code&gt;&lt;/strong&gt; is the joining key: lowercase, &lt;code&gt;0x&lt;/code&gt; prefix stripped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;address_raw&lt;/code&gt;&lt;/strong&gt; keeps the original format for display.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;entity_name&lt;/code&gt;&lt;/strong&gt; is whatever the source calls it ("LAZARUS GROUP", "Safe: Proxy Factory 1.3.0", "").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;label&lt;/code&gt;&lt;/strong&gt; is the source-specific tag, namespaced (&lt;code&gt;etherscan:etherscan:ofac-sanctions-lists&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;source&lt;/code&gt;&lt;/strong&gt; is the dataset identifier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No fuzzy matching on names. Names disagree too often to be a primary signal — that's actually the central finding. The only reliable join is &lt;code&gt;address_norm&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running GoldenMatch at 535k
&lt;/h2&gt;

&lt;p&gt;I started at a sensible scale: five sources plus Sourcify's Ethereum mainnet subset, 535,336 rows, and a direct call to &lt;a href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;&lt;code&gt;goldenmatch.dedupe&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01_ofac.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;02_etherscan_labels.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;03_eth_labels.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;04_ethereum_lists.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;05_defillama.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;STAGED&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;06_sourcify.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This finished in about 40 seconds on a Windows laptop, found 12,640 multi-member clusters, and auto-fixed 51 data quality issues in the raw public sources (smart quotes, invisible characters, stray whitespace) before matching. GoldenCheck's quality scanner is bundled into the dedupe call — you don't ask for it, it just happens.&lt;/p&gt;

&lt;p&gt;The results at 535k surfaced the best single anecdote in the whole exercise:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;th&gt;OFAC name&lt;/th&gt;
&lt;th&gt;Etherscan name&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0x098B716B8Aaf21512996dC57EB0615e2383E2f96&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LAZARUS GROUP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Ronin Bridge Exploiter&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0x5f48c2a71b2cc96e3f0ccae4e39318ff0dc375b2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;SEMENOV, Roman&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Tornado.Cash: Team 1 Vesting&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first row is the Axie Infinity Ronin Bridge wallet — the address behind the $625 million Lazarus Group hack, labeled by OFAC as "LAZARUS GROUP" and by Etherscan as "Ronin Bridge Exploiter." Two correct names, completely unrelated strings. A name-based join finds nothing. An address-normalized join finds the link instantly. The second row ties a sanctioned Tornado Cash co-founder to a specific named vesting contract. If you took only one thing from this post, take that: &lt;strong&gt;names disagree, addresses don't&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling to 13 million
&lt;/h2&gt;

&lt;p&gt;The 535k run validated the pipeline. I wanted to know what happened at the real ceiling of free public data. That meant pulling all 14 Sourcify deployment parquets (one per million contracts, ~2 GB total) covering every chain Sourcify tracks — not just Ethereum mainnet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# fetch_all_sourcify.py — parallel download of 14 parquets
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fut&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fetch_parquet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;fut&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then the staging step streams them directly into the common schema via Polars without ever materializing the full frame:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pq&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parquets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sourcify:chain_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chain_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Utf8&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sourcify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;include_header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After staging, the full dataset is 13,147,920 rows across 10 sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  The honest caveat
&lt;/h3&gt;

&lt;p&gt;At 13M rows, calling &lt;code&gt;goldenmatch.dedupe&lt;/code&gt; crashes at the cluster-materialization step with a &lt;code&gt;MemoryError&lt;/code&gt; in the Python dict build-out. That's not a GoldenMatch bug — it's pure Python object overhead on 12 million unique cluster keys. Since the full pipeline was already reducing to exact-match-on-&lt;code&gt;address_norm&lt;/code&gt; blocking (names disagree too much to fuzzy on), the operation is mathematically equivalent to a &lt;code&gt;polars&lt;/code&gt; groupby. I wrote that directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;all_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scan_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;STAGED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;all_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group_by&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;unique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;n_unique&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_sources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entity_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entities&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same logical result, fits in memory, runs in about 30 seconds. The 535k run proves the ER pipeline works end-to-end with GoldenMatch's full feature set (fuzzy scorers, blocking strategies, lineage, golden records). The 13M run uses GoldenMatch's auto-config decisions as the template but delegates the exact-match groupby to Polars because Python dicts are the wrong data structure at that volume. I want to be upfront about that — the scale-up is not an endorsement of "GoldenMatch scales to 13M natively," it's an endorsement of "GoldenMatch chose the right blocking strategy at 535k, and that strategy is trivially reproducible at 13M in a columnar engine."&lt;/p&gt;

&lt;h2&gt;
  
  
  What the 13M run surfaced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Nine wallets cross-sanctioned by two governments
&lt;/h3&gt;

&lt;p&gt;This is the headline finding and only possible because I had two independent sanctions sources. Nine crypto wallets appear on both the US Treasury OFAC list and Israel's MoD sanctioned crypto list:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Wallet&lt;/th&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Chain&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TCzq6m2zxnQkrZrf8cqYcK6bbXQYAfWYKC&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ZEDCEX EXCHANGE LTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TGsNFrgWfbGN2gX25Wcf8oTejtxtQkvmEx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ZEDCEX EXCHANGE LTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TTS9o5KkpGgH8cK9LofLmMAPYb5zfQvSNa&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ZEDCEX EXCHANGE LTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TNuA5CQ6LB4jTHoNrjEeQZJmcmhQuHMbQ7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ZEDCEX EXCHANGE LTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TLvuvpfBKdxddxSsJefeiGCe9eVY8HUroE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;ZEDCEX EXCHANGE LTD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TWBAPzpPiZarfVsY2BLXeaLhNHurn4wkWG&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AL-LAW, Tawfiq Muhammad Sa'id&lt;/td&gt;
&lt;td&gt;Tron&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0x175d44451403Edf28469dF03A9280c1197ADb92c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GAZA NOW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ethereum&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0x21B8d56BDA776bbE68655A16895afd96F5534feD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GAZA NOW&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ethereum&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;19D1iGzDr7FyAdiy3ZZdxMd6ttHj1kj6WW&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BUY CASH MONEY AND MONEY TRANSFER CO&lt;/td&gt;
&lt;td&gt;Bitcoin&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ZEDCEX cluster is the standout: five wallets on a single Tron-based exchange independently sanctioned by both the United States and Israel. GAZA NOW contributes two cross-confirmed Ethereum wallets. These are the highest-confidence sanctioned wallets in the entire dataset — not because any individual list is more authoritative, but because two independent government entity resolution processes landed on the same on-chain identities.&lt;/p&gt;

&lt;p&gt;You cannot find this with OFAC alone. You cannot find it with Israel's list alone. You find it only when you reconcile them.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The largest clusters are universal infrastructure
&lt;/h3&gt;

&lt;p&gt;At multi-chain scale, the top multi-source clusters by member count are all deterministic-deployment contracts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Address&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x7cbb62eaa69f79e6873cd1ecb2392971036cfaa4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Create Call 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x40a2accbd92bca938b02010e17a5b8929b49130d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Multi Send Call Only 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;42&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0xa6b71e26c5e0845f74c812102ca7114b6a896ab2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Proxy Factory 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x3e5c63644e683549055b9be8653de26e0b4cd36e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Singleton L2 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0xd9db270c1b5e3bd161e8c8503c55ceabee709552&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Singleton 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;37&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0xf48f2b2d2a534e402487b3ee7c18c33aec0fe5e4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Safe: Compatibility Fallback Handler 1.3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x000000000022d473030f116ddee9f6b43ac78ba3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Uniswap Permit2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0x66a71dcef29a0ffbdbe3c6a460a3b5bc225cd675&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LayerZero Ethereum Endpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Safe (formerly Gnosis Safe) uses CREATE2 with chain-independent salts, which means the same singleton contract ends up at the same address on every EVM chain it's deployed to. So do Permit2 and LayerZero. The cluster size is literally a count of "how many chains is this deployed on?" — 45 chains for &lt;code&gt;Create Call 1.3.0&lt;/code&gt;, 30 for Permit2.&lt;/p&gt;

&lt;p&gt;That's a real finding about the structure of the modern EVM ecosystem. Entity resolution on multi-chain contract deployment data &lt;strong&gt;automatically surfaces the universal-infrastructure layer&lt;/strong&gt; without anyone asking it to. If you're building an allowlist of "standard reusable contracts that are safe-by-reputation across every chain," this cluster table is a reasonable starting point. I did not go in looking for this — it just fell out of the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Attackers verify source code at the long-tail baseline rate
&lt;/h3&gt;

&lt;p&gt;I also pulled Forta's labelled-datasets, which include 719 known-malicious Ethereum smart contracts and 569 phishing-scam contracts. The honest question: &lt;strong&gt;do attackers publish verified source code on Sourcify?&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Population&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Verified on Sourcify&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Forta malicious contracts&lt;/td&gt;
&lt;td&gt;719&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 (0.4%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Forta phishing contracts&lt;/td&gt;
&lt;td&gt;569&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 (0.5%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ScamSniffer addresses&lt;/td&gt;
&lt;td&gt;2,530&lt;/td&gt;
&lt;td&gt;0 (0%)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is where I have to resist the obvious headline. "Malicious contracts almost never verify source code" is technically true but misleading: Sourcify holds ~324k verified Ethereum mainnet contracts against an estimated ~70M+ total contracts ever deployed, which puts the &lt;em&gt;baseline&lt;/em&gt; verification rate around 0.5%. Malicious contracts at 0.4% are statistically indistinguishable from that baseline.&lt;/p&gt;

&lt;p&gt;The defensible framing is this: &lt;strong&gt;malicious contracts behave like the long tail of random/abandoned/spam contracts, not like production contracts.&lt;/strong&gt; Mainstream DeFi protocols verify at rates well above 50%. Attackers don't. For the purposes of attribution, "is this a Sourcify-verified contract?" on its own is a weak filter — but "verified Sourcify contract AND has Etherscan tags AND appears in eth-labels AND DeFiLlama" is an extremely strong &lt;em&gt;legitimacy&lt;/em&gt; signal. The 301 quadruple-confirmed clusters at 13M scale are the set of contracts that every independent attribution observer agrees exist and matter.&lt;/p&gt;

&lt;p&gt;The 3 verified malicious contracts are outliers worth manual investigation: two are &lt;code&gt;Fake_Phishing&lt;/code&gt; tagged contracts that nonetheless published source (presumably to look legitimate to a casual reviewer), and one is a suspicious "TrueEUR" token.&lt;/p&gt;

&lt;h2&gt;
  
  
  What else is in the data
&lt;/h2&gt;

&lt;p&gt;A few secondary findings that didn't need their own section:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deployer-reuse patterns in the malicious-contracts dataset.&lt;/strong&gt; The Forta dataset records &lt;code&gt;contract_creator&lt;/code&gt; for each malicious contract. Grouping by creator surfaces a heavily skewed distribution: one address deployed &lt;strong&gt;15&lt;/strong&gt; &lt;code&gt;Fake_Phishing&lt;/code&gt; contracts, another deployed &lt;strong&gt;11&lt;/strong&gt;, and the original &lt;strong&gt;bZx Exploiter 1&lt;/strong&gt; wallet deployed &lt;strong&gt;9&lt;/strong&gt; distinct exploit contracts — all tied back to the 2020 bZx flash-loan attack. Twelve deployer addresses are responsible for roughly 15% of the entire labeled malicious-contract corpus. Watching deployers is dramatically more efficient than watching deployments, and the clustering falls out of ER trivially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OFAC's internal duplicates.&lt;/strong&gt; The SUEX OTC wallets appear in OFAC's own list twice — once under the &lt;code&gt;XBT:CYBER2&lt;/code&gt; program and once under &lt;code&gt;USDT:CYBER2&lt;/code&gt;, because Treasury sanctioned the same Bitcoin address for Bitcoin activity and the Tron-USDT it bridged through. Without ER you'd treat them as two distinct records; with ER the internal duplication is obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-source pattern distribution.&lt;/strong&gt; At 13M the dominant multi-source pattern is &lt;code&gt;etherscan-labels + sourcify&lt;/code&gt; (12,472 clusters) — verified contracts that are also tagged. Then &lt;code&gt;eth-labels + forta&lt;/code&gt; (5,146) — curated DeFi labels overlapping with malicious flags. Then the triple-confirmed &lt;code&gt;eth-labels + etherscan-labels + sourcify&lt;/code&gt; (5,034). The full distribution is in &lt;a href="https://github.com/benzsevern/goldenmatch-wallet-attribution/blob/main/output_15m/report.json" rel="noopener noreferrer"&gt;&lt;code&gt;output_15m/report.json&lt;/code&gt;&lt;/a&gt; in the companion repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;p&gt;I want to be precise about what this analysis is and isn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;It is not criminal wallet discovery.&lt;/strong&gt; Every "sanctioned" label comes from a government source. ER reconciles those labels across sources. It does not identify new bad actors. Nothing in this post claims to.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It is not a substitute for on-chain forensics.&lt;/strong&gt; Chainalysis-style graph tracing answers a completely different question (follow the flows). This pipeline answers "whose opinions do we have about this address, and do they agree?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ground truth is bounded.&lt;/strong&gt; When two sanctions lists agree, you have two-jurisdiction confirmation, which is strong. When Forta and eth-labels agree on a malicious tag, you have two independent community labels, which is weaker. Nothing here is a court case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Sourcify baseline assumes the universe of all Ethereum contracts.&lt;/strong&gt; If you normalize against "contracts anyone cares about" instead of "all contracts ever deployed," the verification-rate story changes. I chose the inclusive denominator on purpose — it's what Sourcify's own data supports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Names are dirty.&lt;/strong&gt; I found five different OFAC entries for the same Bitcoin address with slightly different entity spellings; two different Etherscan tags for the same Lazarus wallet; and an Israeli-sourced wallet whose "entity_name" field was just the address itself. ER is only as clean as the input. &lt;a href="https://github.com/benzsevern/goldencheck" rel="noopener noreferrer"&gt;GoldenCheck&lt;/a&gt; auto-fixed 51 text-level issues before matching, but it didn't — and shouldn't — normalize semantic disagreement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ten public blockchain attribution datasets, 13.1M records, 30,958 multi-source clusters.&lt;/strong&gt; The free public attribution universe is larger than it looks if you combine it, and trivially reconcilable if you normalize the address.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Names disagree, addresses don't.&lt;/strong&gt; The Lazarus Group / Ronin Bridge Exploiter case is the best two-word argument for entity resolution on blockchain data I've seen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-jurisdictional sanctioning is real and detectable.&lt;/strong&gt; Nine wallets — including a cluster of five ZEDCEX addresses — are sanctioned by both the US and Israel. You only see this if you reconcile multiple sanctions sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ER on multi-chain contract data surfaces universal infrastructure for free.&lt;/strong&gt; The top clusters are Safe, Permit2, and LayerZero — deployed at the same CREATE2 address across 30-45 chains each. Cluster size is the chain count.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attackers verify source code at the long-tail baseline rate, not the production rate.&lt;/strong&gt; The useful signal is not "verified" but "verified &lt;em&gt;and&lt;/em&gt; independently tagged by multiple labelers."&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;

&lt;p&gt;Everything in this post is in a public repo: &lt;strong&gt;&lt;a href="https://github.com/benzsevern/goldenmatch-wallet-attribution" rel="noopener noreferrer"&gt;benzsevern/goldenmatch-wallet-attribution&lt;/a&gt;&lt;/strong&gt;. The full flow is four commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python fetch_public_data.py    &lt;span class="c"&gt;# ~2.5 GB download, ~10 min&lt;/span&gt;
python extract_ofac.py         &lt;span class="c"&gt;# parse SDN_ENHANCED.xml&lt;/span&gt;
python run_15m.py              &lt;span class="c"&gt;# stage 10 sources to common schema&lt;/span&gt;
python analyze_15m.py          &lt;span class="c"&gt;# cross-source cluster analysis&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All ten data sources are permissively licensed and redistributable. No API keys. No auth. The ~13M-row analysis finishes in about 3 minutes of wall-clock time on a laptop once the data is local.&lt;/p&gt;

&lt;p&gt;If you want to see what GoldenMatch looks like with its full feature set — fuzzy scoring, blocking strategies, lineage tracking, golden records — the earlier &lt;a href="https://github.com/benzsevern/goldenmatch-wallet-attribution/blob/main/archive/run_clusters.py" rel="noopener noreferrer"&gt;&lt;code&gt;archive/run_clusters.py&lt;/code&gt;&lt;/a&gt; runs it on the 535k-row subset end-to-end. That's the run that surfaced the Lazarus / Ronin Bridge case. Both scripts are preserved because they're answering different questions: &lt;em&gt;does the ER pipeline work?&lt;/em&gt; (yes, 535k with GoldenMatch) and &lt;em&gt;what falls out at the public-data ceiling?&lt;/em&gt; (the 13M run above).&lt;/p&gt;

&lt;p&gt;Install GoldenMatch: &lt;code&gt;pip install goldenmatch&lt;/code&gt;. Star the repo: &lt;a href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;benzsevern/goldenmatch&lt;/a&gt;. Try the live playground: &lt;a href="https://dev.to/playground"&gt;bensevern.dev/playground&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Reproducibility footer.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source datasets:&lt;/strong&gt; OFAC SDN Enhanced XML (US Treasury), Sourcify parquet exports (export.sourcify.dev, manifest timestamp &lt;code&gt;2026-01-05T16:48:52Z&lt;/code&gt;), &lt;code&gt;brianleect/etherscan-labels&lt;/code&gt; (main), &lt;code&gt;dawsbot/eth-labels&lt;/code&gt; (master), &lt;code&gt;MyEtherWallet/ethereum-lists&lt;/code&gt; (master), &lt;code&gt;forta-network/labelled-datasets&lt;/code&gt; (main), &lt;code&gt;scamsniffer/scam-database&lt;/code&gt; (main), &lt;code&gt;api.llama.fi/protocols&lt;/code&gt;, OpenSanctions &lt;code&gt;us_fbi_lazarus_crypto&lt;/code&gt; and &lt;code&gt;il_mod_crypto&lt;/code&gt; (latest).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total download:&lt;/strong&gt; ~2.5 GB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Input rows:&lt;/strong&gt; 13,147,920 across 10 sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unique addresses:&lt;/strong&gt; 12,588,179.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-source clusters:&lt;/strong&gt; 30,958. Quadruple-confirmed: 301. Cross-sanctioned (US + Israel): 9.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;goldenmatch&lt;/code&gt; 1.4.4, &lt;code&gt;polars&lt;/code&gt; 1.39, Python 3.12.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; Windows laptop, 32 GB RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code and raw outputs:&lt;/strong&gt; &lt;a href="https://github.com/benzsevern/goldenmatch-wallet-attribution" rel="noopener noreferrer"&gt;benzsevern/goldenmatch-wallet-attribution&lt;/a&gt; (MIT). Scripts: &lt;code&gt;fetch_public_data.py&lt;/code&gt;, &lt;code&gt;extract_ofac.py&lt;/code&gt;, &lt;code&gt;run_15m.py&lt;/code&gt;, &lt;code&gt;analyze_15m.py&lt;/code&gt;, &lt;code&gt;analyze_malicious.py&lt;/code&gt;. Headline JSON: &lt;code&gt;output_15m/report.json&lt;/code&gt;. Cross-sanctioned records: &lt;code&gt;output_15m/cross_sanctioned.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data date:&lt;/strong&gt; 2026-04-09.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-09-wallet-attribution-13m-records" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>blockchain</category>
    </item>
    <item>
      <title>Hot take that the benchmark backs up: traditional OSS entity resolution trusts you, the user, to know what you're doing. On 50,000 rows of real healthcare data on a laptop, that trust is misplaced.

Full writeup, real numbers, honest disclaimers</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Wed, 08 Apr 2026 17:21:57 +0000</pubDate>
      <link>https://dev.to/benzsevern/hot-take-that-the-benchmark-backs-up-traditional-oss-entity-resolution-trusts-you-the-user-to-1e0i</link>
      <guid>https://dev.to/benzsevern/hot-take-that-the-benchmark-backs-up-traditional-oss-entity-resolution-trusts-you-the-user-to-1e0i</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1" class="crayons-story__hidden-navigation-link"&gt;The OSS ER Bargain: What Entity Resolution Actually Costs You&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/benzsevern" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" alt="benzsevern profile" class="crayons-avatar__image" width="800" height="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/benzsevern" class="crayons-story__secondary fw-medium m:hidden"&gt;
              benzsevern
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                benzsevern
                
              
              &lt;div id="story-author-preview-content-3472626" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/benzsevern" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" class="crayons-avatar__image" alt="" width="800" height="800"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;benzsevern&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 8&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1" id="article-link-3472626"&gt;
          The OSS ER Bargain: What Entity Resolution Actually Costs You
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/benchmarking"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;benchmarking&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            9 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>The OSS ER Bargain: What Entity Resolution Actually Costs You</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Wed, 08 Apr 2026 17:19:56 +0000</pubDate>
      <link>https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1</link>
      <guid>https://dev.to/benzsevern/the-oss-er-bargain-what-entity-resolution-actually-costs-you-8h1</guid>
      <description>&lt;h1&gt;
  
  
  The OSS ER Bargain: What Entity Resolution Actually Costs You
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Benchmarking &lt;code&gt;dedupe&lt;/code&gt; vs GoldenMatch on 500,000 CMS provider records&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The National Plan and Provider Enumeration System (NPPES) publishes one of the largest open healthcare directories in the world: 6+ million U.S. providers, updated monthly, with names spelled four different ways, addresses that drift across quarters, and enough Smiths and Garcias to keep any blocking algorithm honest. It's a reasonable stand-in for the kind of data most organizations actually have: real, messy, and big enough to hurt.&lt;/p&gt;

&lt;p&gt;I wanted to see what it costs to resolve a dataset like this with traditional open-source entity resolution, versus a holistic approach. So I took 500,000 randomly-sampled records from the March 2026 NPPES release and pointed two tools at them: &lt;a href="https://github.com/dedupeio/dedupe" rel="noopener noreferrer"&gt;&lt;code&gt;dedupe&lt;/code&gt;&lt;/a&gt;, the canonical Python OSS deduper, and &lt;a href="https://github.com/severn/goldenmatch" rel="noopener noreferrer"&gt;GoldenMatch&lt;/a&gt;, the matching engine at the heart of the Golden Suite.&lt;/p&gt;

&lt;p&gt;This isn't a precision/recall bake-off. NPPES ships no ground-truth duplicate labels, and I refused to inject synthetic ones — faking the test data to prove a point is cheating. What I measured instead is what it actually feels like to use each tool: wall-clock runtime, peak memory, how many decisions you have to make, and — critically — whether the tool can even finish the job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The OSS bargain
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;dedupe&lt;/code&gt; is, in many ways, the textbook open-source entity resolution library. It's well-documented, actively maintained, used in production at real companies, and its active-learning approach is genuinely clever: rather than make you write deterministic rules, it surfaces pairs of records it's uncertain about and asks you to label them.&lt;/p&gt;

&lt;p&gt;That cleverness has a cost, and the cost is you.&lt;/p&gt;

&lt;p&gt;Setting up &lt;code&gt;dedupe&lt;/code&gt; on NPPES means answering a sequence of questions the tool can't answer itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Which fields do you want to match on?&lt;/strong&gt; Pick wrong and your recall tanks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What types are they — &lt;code&gt;String&lt;/code&gt;, &lt;code&gt;Exact&lt;/code&gt;, &lt;code&gt;ShortString&lt;/code&gt;, &lt;code&gt;Price&lt;/code&gt;, &lt;code&gt;LatLong&lt;/code&gt;?&lt;/strong&gt; Each has different behavior and you need to know which.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How should it sample training pairs? What &lt;code&gt;sample_size&lt;/code&gt;? What &lt;code&gt;blocked_proportion&lt;/code&gt;?&lt;/strong&gt; These numbers shape what dedupe even sees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is your labeler honest?&lt;/strong&gt; Without ground truth, you're either clicking through uncertain pairs yourself, or — as I did here — writing a deterministic rule that labels pairs programmatically. Either way, you own the decision.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What threshold do you partition at?&lt;/strong&gt; &lt;code&gt;0.5&lt;/code&gt;? &lt;code&gt;0.3&lt;/code&gt;? &lt;code&gt;0.7&lt;/code&gt;? The number is yours. &lt;code&gt;dedupe&lt;/code&gt; will not tell you which one is right for your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;index_predicates=True&lt;/code&gt; or &lt;code&gt;False&lt;/code&gt;?&lt;/strong&gt; In dedupe 3.x, the "True" path needs an extra explicit indexing step or it crashes with &lt;code&gt;NoIndexError&lt;/code&gt; mid-partition. I found this out the hard way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these questions have wrong answers in isolation. What they have in common is that every one of them is a decision the &lt;em&gt;user&lt;/em&gt; has to make, and every one of them silently changes the output of the algorithm downstream. &lt;code&gt;dedupe&lt;/code&gt; trusts you to know what you're doing. When you don't, you get quiet failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The holistic alternative
&lt;/h2&gt;

&lt;p&gt;GoldenMatch takes a different approach. You still write a config — I'm not going to pretend it's zero-configuration — but the config describes &lt;em&gt;what your data is&lt;/em&gt;, not &lt;em&gt;how dedupe should learn to resolve it&lt;/em&gt;. The blocking strategy, the scorers, the weight vectors, the clustering step, and the schema inference are all owned by the library. You point it at your polars DataFrame and call &lt;code&gt;dedupe_df&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here's the whole GoldenMatch setup I used for NPPES:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soundex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[]),&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;substring:0:3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;max_block_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;skip_oversized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;matchkeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;MatchkeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weighted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaro_winkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaro_winkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;org_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaro_winkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
                &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;goldenmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;That's the whole thing. Three blocking passes (phonetic surname, exact zip, organization prefix), six weighted field scorers, one threshold. No training loop. No uncertain-pair labeling. No "did I pick the right number of training pairs" anxiety.&lt;/p&gt;
&lt;h2&gt;
  
  
  What happened at 50,000 rows
&lt;/h2&gt;

&lt;p&gt;I ran both tools on a 50,000-row slice of the NPPES sample:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;&lt;code&gt;dedupe&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;GoldenMatch&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wall-clock runtime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3,589 s (59.8 min)&lt;/td&gt;
&lt;td&gt;17.3 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;207×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Peak process RSS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8,699 MB&lt;/td&gt;
&lt;td&gt;602 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14×&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-record clusters found&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2,857&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config lines&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;206&lt;/td&gt;
&lt;td&gt;148&lt;/td&gt;
&lt;td&gt;1.4×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Human decisions required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8+ (see list above)&lt;/td&gt;
&lt;td&gt;3 (blocking, scorers, threshold)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The runtime and memory numbers are jaw-dropping on their own. But look at the "multi-record clusters found" row. &lt;code&gt;dedupe&lt;/code&gt; returned &lt;strong&gt;zero&lt;/strong&gt; clusters with more than one record. It produced 50,000 singletons — a perfectly unhelpful partition that says every record is its own entity.&lt;/p&gt;

&lt;p&gt;This is not because NPPES has no duplicates. GoldenMatch found 2,857 multi-record clusters on the same data: real matches like &lt;code&gt;PETER ROBERT NEHREBECKI&lt;/code&gt; at &lt;code&gt;240 SHOTWELL ST STE 206&lt;/code&gt; appearing twice under different NPIs, or organizational providers sharing an address and a taxonomy code. The duplicates are there. &lt;code&gt;dedupe&lt;/code&gt; just couldn't see them.&lt;/p&gt;

&lt;p&gt;Why not? Because &lt;code&gt;dedupe&lt;/code&gt;'s classifier needs balanced positive and negative training pairs, and the deterministic rule oracle I fed it (match iff same NPI, or same normalized &lt;code&gt;last_name + first_name + zip5&lt;/code&gt;) rarely triggers in a random 50k slice of NPPES. Without enough positives, the classifier collapses to "everything is distinct," sklearn warns "only one class in y," and you wait an hour for an output that says nothing.&lt;/p&gt;

&lt;p&gt;Could I fix this? Yes. I could loosen the rule oracle, or pre-seed with softer matches, or hand-label pairs, or try a different classifier. All of those are more decisions I'd have to make — decisions that &lt;code&gt;dedupe&lt;/code&gt;'s design says are mine to own. I ran it honestly, with a clearly-documented protocol, and honestly is what I got.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling out: does GoldenMatch survive 500,000?
&lt;/h2&gt;

&lt;p&gt;Having established that &lt;code&gt;dedupe&lt;/code&gt; is not going to finish NPPES at any interesting scale on a laptop, I ran GoldenMatch up the ladder.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;GoldenMatch runtime&lt;/th&gt;
&lt;th&gt;Peak RSS&lt;/th&gt;
&lt;th&gt;Multi-record clusters&lt;/th&gt;
&lt;th&gt;Records collapsed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;17.3 s&lt;/td&gt;
&lt;td&gt;602 MB&lt;/td&gt;
&lt;td&gt;2,857&lt;/td&gt;
&lt;td&gt;2,857&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;47.0 s&lt;/td&gt;
&lt;td&gt;731 MB&lt;/td&gt;
&lt;td&gt;~9,511&lt;/td&gt;
&lt;td&gt;9,511&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500,000&lt;/td&gt;
&lt;td&gt;261.0 s&lt;/td&gt;
&lt;td&gt;2,150 MB&lt;/td&gt;
&lt;td&gt;~120,191&lt;/td&gt;
&lt;td&gt;120,191&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Ten times the data, fifteen times the runtime, four times the memory, and roughly forty times the duplicates found. Sub-linear scaling on cluster count — unsurprising, since large datasets surface more duplicate pairs per row. The 500k run finished in 4 minutes 21 seconds using 2.1 GB of RAM on a Windows laptop. Whatever dedupe was doing with its 8.7 GB and its hour of CPU at 50k, GoldenMatch was doing 10× the work in a quarter of the time and a quarter of the memory.&lt;/p&gt;
&lt;h2&gt;
  
  
  What the sensitivity analysis actually shows
&lt;/h2&gt;

&lt;p&gt;I also swept GoldenMatch through 5 config variations at 50k — four threshold values (0.65, 0.70, 0.80, 0.85) plus a stricter weight preset — and measured Adjusted Rand Index against the default run:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variant&lt;/th&gt;
&lt;th&gt;ARI vs default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;threshold=0.65&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.5044&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;threshold=0.70&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.7299&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;threshold=0.80&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.4716&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;threshold=0.85&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.2821&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;preset_strict&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.8505&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's what I want to flag honestly: &lt;strong&gt;GoldenMatch's output is sensitive to threshold&lt;/strong&gt;. The ARI range across variants is 0.57 — that's a lot of movement. If your only claim was "holistic ER is stable under config changes," this table would undermine you.&lt;/p&gt;

&lt;p&gt;I don't think that's the right claim.&lt;/p&gt;

&lt;p&gt;The right claim is: &lt;strong&gt;the knobs work&lt;/strong&gt;. When you tighten the threshold from 0.65 to 0.85, GoldenMatch produces noticeably stricter clusters — exactly as you'd expect. The threshold is a real, functional control surface, not a cosmetic dial. A sensitivity of 0.57 ARI means the tool actually does different things when you ask it to.&lt;/p&gt;

&lt;p&gt;And — here's the uncomfortable counterpart — I cannot compare this to dedupe's sensitivity, because dedupe at 50k produces all-singletons at every threshold. Dedupe's "sensitivity" is 0.0 because the output is trivially constant: nothing, nothing, nothing, nothing. Perfect stability, zero utility.&lt;/p&gt;

&lt;p&gt;That's the shape of the real comparison. One tool has knobs that work on a job it can actually finish. The other tool's knobs don't matter because it never got to a meaningful output in the first place.&lt;/p&gt;
&lt;h2&gt;
  
  
  What "holistic" actually means
&lt;/h2&gt;

&lt;p&gt;When I say GoldenMatch's approach is holistic, I do not mean "it hides the hard decisions from you." Clearly it doesn't — the threshold matters, the blocking choices matter, the scorer weights matter. You can see every one of them in the config block above.&lt;/p&gt;

&lt;p&gt;What I mean is that GoldenMatch &lt;strong&gt;owns the decisions the user shouldn't have to own&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether to build an index over blocking predicates, and when to release it. &lt;code&gt;dedupe&lt;/code&gt; makes this your problem and crashes if you guess wrong.&lt;/li&gt;
&lt;li&gt;Whether to fall back to a lookup table when a block grows oversized. &lt;code&gt;dedupe&lt;/code&gt; blows your memory budget before you notice.&lt;/li&gt;
&lt;li&gt;How to assemble per-field scores into a cluster decision, and how to verify that decision across the transitive closure of pairs. &lt;code&gt;dedupe&lt;/code&gt; leaves this to a classifier whose training data you have to provide.&lt;/li&gt;
&lt;li&gt;How to handle the case where your labeled training set has no positives. &lt;code&gt;dedupe&lt;/code&gt; collapses silently. GoldenMatch doesn't need labels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The OSS bargain is: the library gives you flexibility, and the cost is that you own the consequences of every degree of freedom it exposes. That's fine for small datasets, clean schemas, and practitioners who already know what they're doing. On 500,000 rows of real NPPES data on a laptop, it's not a bargain — it's a trap.&lt;/p&gt;
&lt;h2&gt;
  
  
  The disclaimers
&lt;/h2&gt;

&lt;p&gt;I want to be precise about what this benchmark is and isn't:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No ground truth.&lt;/strong&gt; NPPES doesn't ship duplicate labels, and I didn't inject synthetic ones. Every "duplicates found" number is what each tool reports, not what is objectively correct. Some of GoldenMatch's 2,857 clusters at 50k are probably wrong. Without ground truth, I can't tell you the precision or recall of either tool. What I &lt;em&gt;can&lt;/em&gt; tell you is that 0 is not the right answer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dedupe's labeling protocol matters a lot.&lt;/strong&gt; I used a deterministic rule (NPI equality OR normalized &lt;code&gt;last_name + first_name + zip5&lt;/code&gt; equality) to label pairs for dedupe. A different protocol — a hand-labeled training set, or a looser rule — would likely give dedupe a fighting chance to learn a real classifier. My protocol is strict on purpose: it's the kind of thing a data engineer would actually write when they need a reproducible pipeline without human-in-the-loop labeling. If your protocol is softer, your results will differ.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory numbers include the Python interpreter and loaded libraries.&lt;/strong&gt; Peak RSS is measured via &lt;code&gt;psutil.Process().memory_info().rss&lt;/code&gt; sampled every 500ms in a background thread. Both tools share the same baseline, so the comparison is fair, but don't read "8,699 MB" as "what dedupe's data structures allocated" — read it as "what the process was holding at its peak."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GoldenMatch benefits from recent memory-management work.&lt;/strong&gt; The Golden Suite has had explicit OOM-prevention work over the last several months. Dedupe doesn't. That asymmetry is real, and I'm not pretending it isn't. If you ran this on dedupe's preferred architecture (e.g., with Postgres-backed storage via &lt;code&gt;dedupe-examples&lt;/code&gt;), the memory number would improve — at the cost of adding Postgres to your workflow, which is yet another decision you'd have to make.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;dedupe&lt;/code&gt; is an excellent tool in its lane.&lt;/strong&gt; I'm not here to bury it. On small, labeled datasets with an engaged human, it does exactly what it says on the tin. The point of this post is that "small, labeled, with an engaged human" is a much narrower lane than it looks, and lots of real-world ER problems fall outside it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you take nothing else from this post, take this: &lt;strong&gt;the cost of an entity resolution tool is not the license fee, it's the number of decisions the tool hands back to you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;dedupe&lt;/code&gt; hands you the field types, the blocking predicates, the sample size, the training labels, the classifier choice, the index strategy, the threshold, and the prayer that it all adds up to something useful. At 50,000 rows of NPPES on my laptop, it did not.&lt;/p&gt;

&lt;p&gt;GoldenMatch hands you a config, runs, and tells you the answer. The answer is opinionated — the threshold matters, the weights matter — but the tool finishes the job, and the job at scale is the job that actually matters.&lt;/p&gt;

&lt;p&gt;Your mileage will vary. Your data is not NPPES. Your hardware is not my laptop. Your labeling protocol is not my labeling protocol. But the next time you're evaluating an ER tool, don't just ask "what accuracy does it reach?" — ask "on my data, at my scale, with the time I have, does it finish?"&lt;/p&gt;

&lt;p&gt;For NPPES on a laptop, the answer to that question is already decided.&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Reproducibility footer.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Source data:&lt;/strong&gt; NPPES Full Replacement Monthly NPI File, March 2026 (V2) release.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;URL:&lt;/strong&gt; &lt;code&gt;https://download.cms.gov/nppes/NPPES_Data_Dissemination_March_2026_V2.zip&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downloaded:&lt;/strong&gt; 2026-04-08T15:01:58Z&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zip SHA-256:&lt;/strong&gt; &lt;code&gt;34ba67637c69bc72dfe48f28625d3988550c679fdbc95786af543228912cb463&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sample:&lt;/strong&gt; 500,000 rows via streaming reservoir sample (seed=42), columns pinned to &lt;code&gt;npi, entity_type, org_name, last_name, first_name, middle_name, address, city, state, zip, taxonomy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;dedupe&lt;/code&gt; (3.x), &lt;code&gt;goldenmatch&lt;/code&gt; 1.4.3, Python 3.12.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hardware:&lt;/strong&gt; Windows laptop, 32 GB RAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code:&lt;/strong&gt; &lt;code&gt;comparison_bench/&lt;/code&gt; in the &lt;code&gt;golden-showcase&lt;/code&gt; repo. Scripts: &lt;code&gt;data_prep.py&lt;/code&gt;, &lt;code&gt;run_dedupe_nppes.py&lt;/code&gt;, &lt;code&gt;run_goldenmatch_nppes.py&lt;/code&gt;, &lt;code&gt;feasibility_probe_nppes.py&lt;/code&gt;, &lt;code&gt;bench_utils.py&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw results:&lt;/strong&gt; &lt;code&gt;results_dedupe_nppes.json&lt;/code&gt;, &lt;code&gt;results_goldenmatch_nppes.json&lt;/code&gt;, &lt;code&gt;results_feasibility_nppes.json&lt;/code&gt;, plus per-run cluster sidecars in &lt;code&gt;comparison_bench/clusters/&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://bensevern.dev/" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;bensevern.dev&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/benzsevern" rel="noopener noreferrer"&gt;
        benzsevern
      &lt;/a&gt; / &lt;a href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;
        goldenmatch
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Entity resolution and deduplication toolkit — outperforms Splink, dedupe, and RecordLinkage on cross-domain benchmarks. Zero-config. MST cluster auto-splitting. Quality-weighted survivorship. 30 MCP tools on Smithery. 10 A2A skills. 97.2% F1 on DBLP-ACM.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;GoldenMatch&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Find duplicate records in 30 seconds. No rules to write, no models to train.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/benzsevern/goldenmatch/docs/screenshots/demo.svg"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fbenzsevern%2Fgoldenmatch%2FHEAD%2Fdocs%2Fscreenshots%2Fdemo.svg" alt="GoldenMatch Demo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install goldenmatch
goldenmatch dedupe customers.csv&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://pypi.org/project/goldenmatch/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/182c7ae0464a92d03d793883b0a540d4bd4a0b2f93acc815af226e48f3b0b871/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f676f6c64656e6d617463683f636f6c6f723d643461303137" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/goldenmatch/actions/workflows/ci.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/benzsevern/goldenmatch/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/benzsevern/goldenmatch" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/c38c8163e2d18b85dce1b257fbe91c1a02c08c438a2622331a82c42b6d872749/68747470733a2f2f636f6465636f762e696f2f67682f62656e7a73657665726e2f676f6c64656e6d617463682f67726170682f62616467652e737667" alt="codecov"&gt;&lt;/a&gt;
&lt;a href="https://pepy.tech/project/goldenmatch" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/493c6ca1dd1feaa3534896866bde4f14753e897b64998ca1cf3aa50910f504c8/68747470733a2f2f7374617469632e706570792e746563682f62616467652f676f6c64656e6d617463682f6d6f6e7468" alt="Downloads"&gt;&lt;/a&gt;
&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/96abf9b704f80578ea56dd10cab0d911c56d46dbec347f431ece9cf60ac175ad/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e31312532422d626c7565" alt="Python 3.11+"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/goldenmatch/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f8df3091bbe1149f398a5369b2c39e896766f9f6efba3477c63e9b4aa940ef14/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d677265656e" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://benzsevern.github.io/goldenmatch/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/040505e179a1ceb6cba80f6430eaa225f3b80c4bc0f5bdfe115410453ad35bf8/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d62656e7a73657665726e2e6769746875622e696f253246676f6c64656e6d617463682d643461303137" alt="Docs"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/dqbench" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0022e7da8518f5c816a5f0b17831e77e0289f7da8f712f4ad83d4d2388ac3480/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f445142656e636825323045522d39352e33302d676f6c64" alt="DQBench ER"&gt;&lt;/a&gt;
&lt;a href="https://colab.research.google.com/github/benzsevern/goldenmatch/blob/main/scripts/gpu_colab_notebook.ipynb" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/eff96fda6b2e0fff8cdf2978f89d61aa434bb98c00453ae23dd0aab8d1451633/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Why GoldenMatch?&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config&lt;/strong&gt; — auto-detects columns, picks scorers, and runs. No training data needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;97.2% F1&lt;/strong&gt; on DBLP-ACM out of the box. &lt;a href="https://github.com/benzsevern/dqbench" rel="noopener noreferrer"&gt;DQBench ER score: 95.30&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy-preserving&lt;/strong&gt; — match across organizations without sharing raw data (PPRL, 92.4% F1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;30 MCP tools&lt;/strong&gt; — use from Claude Desktop, Claude Code, or any AI assistant (&lt;a href="https://smithery.ai/servers/benzsevern/goldenmatch" rel="nofollow noopener noreferrer"&gt;Smithery&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-ready&lt;/strong&gt; — Postgres sync, daemon mode, lineage tracking, review queues&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Choose your path&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
&lt;thead&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;th&gt;I want to...&lt;/th&gt;
&lt;br&gt;
&lt;th&gt;Go here&lt;/th&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/thead&gt;
&lt;br&gt;
&lt;tbody&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Deduplicate a CSV right now&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://benzsevern.github.io/goldenmatch/quick-start" rel="nofollow noopener noreferrer"&gt;Quick Start&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Use from Claude Desktop / AI assistant&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://benzsevern.github.io/goldenmatch/mcp" rel="nofollow noopener noreferrer"&gt;MCP Server&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Build AI agents that deduplicate&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://benzsevern.github.io/goldenmatch/agent" rel="nofollow noopener noreferrer"&gt;ER Agent (A2A)&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Write Python code&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://benzsevern.github.io/goldenmatch/python-api" rel="nofollow noopener noreferrer"&gt;Python API&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Use the interactive TUI&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://benzsevern.github.io/goldenmatch/tui" rel="nofollow noopener noreferrer"&gt;TUI Guide&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;&lt;br&gt;
&lt;/p&gt;

&lt;br&gt;


&lt;strong&gt;All features&lt;/strong&gt; (click to expand)

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Matching&lt;/h3&gt;

&lt;/div&gt;


&lt;ul&gt;

&lt;li&gt;

&lt;strong&gt;10+ scoring methods&lt;/strong&gt; — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice…&lt;/li&gt;

&lt;/ul&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/benzsevern/goldenmatch" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>MCPs enabling data cleaning and deduping.</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:27:49 +0000</pubDate>
      <link>https://dev.to/benzsevern/mcps-enabling-data-cleaning-and-deduping-5d8b</link>
      <guid>https://dev.to/benzsevern/mcps-enabling-data-cleaning-and-deduping-5d8b</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1" class="crayons-story__hidden-navigation-link"&gt;Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/benzsevern" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" alt="benzsevern profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/benzsevern" class="crayons-story__secondary fw-medium m:hidden"&gt;
              benzsevern
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                benzsevern
                
              
              &lt;div id="story-author-preview-content-3467276" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/benzsevern" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;benzsevern&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 7&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1" id="article-link-3467276"&gt;
          Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/mcp"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;mcp&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/aiagents"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;aiagents&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;1&lt;span class="hidden s:inline"&gt; reaction&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            5 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:02:45 +0000</pubDate>
      <link>https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1</link>
      <guid>https://dev.to/benzsevern/golden-suite-mcp-giving-ai-agents-a-data-cleaning-toolkit-23k1</guid>
      <description>&lt;p&gt;An AI agent can write SQL, draft an email, and refactor a repo. Ask it to deduplicate a 50,000-row customer file and it will cheerfully hand you a &lt;code&gt;pandas.drop_duplicates()&lt;/code&gt; one-liner that finds zero matches. The model knows the concept. It does not know your data, and it has no tool that actually solves entity resolution.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; (MCP) is the missing wire. It lets a host like Claude Code, Cursor, or any agent runtime call real tools running on your machine — with real schemas, real parameters, and real results. Golden Suite was built as a set of composable Python packages from day one, which makes it a near-perfect fit. This post walks through how we expose Golden Suite over MCP, what that unlocks for AI workflows, and where the roadmap goes from here.&lt;/p&gt;

&lt;h2&gt;
  
  
  What MCP actually is
&lt;/h2&gt;

&lt;p&gt;MCP is a thin JSON-RPC protocol that standardises three things between an AI host and an external server:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; — typed functions the model can call (&lt;code&gt;goldenmatch.dedupe&lt;/code&gt;, &lt;code&gt;infermap.map_schema&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resources&lt;/strong&gt; — readable artifacts the model can pull into context (a sample of a CSV, a profiling report)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompts&lt;/strong&gt; — pre-baked instruction templates the host can offer the user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The host handles the LLM. The server handles the work. The contract between them is a stable schema, which means the same Golden Suite MCP server works in Claude Desktop, Cursor, Continue, or a custom Agent SDK app — without rewriting any glue code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Golden Suite fits
&lt;/h2&gt;

&lt;p&gt;Each Golden Suite package is already a small, well-typed Python API:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Package&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Natural MCP tool&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;infermap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schema mapping between source and target&lt;/td&gt;
&lt;td&gt;&lt;code&gt;map_schema(source, target)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoldenCheck&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Profiling and data quality scanning&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;profile(path)&lt;/code&gt;, &lt;code&gt;quality_report(path)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoldenFlow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Auto-transformation of messy values&lt;/td&gt;
&lt;td&gt;&lt;code&gt;clean(path, rules?)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoldenMatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entity resolution and deduplication&lt;/td&gt;
&lt;td&gt;&lt;code&gt;dedupe(path, config?)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoldenPipe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Orchestrates the full pipeline&lt;/td&gt;
&lt;td&gt;&lt;code&gt;run_pipeline(path)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Wrapping these as MCP tools is mostly metadata — the underlying functions already accept paths, return structured results, and stream progress. A minimal server looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dedupe&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;infermap&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;map_schema&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenpipe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_pipeline&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;golden-suite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;goldenmatch_dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Deduplicate a CSV using fuzzy entity resolution.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_rows&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clusters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clusters&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;golden_records&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;golden_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;infermap_map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_csv&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Map columns from a source CSV to a target schema.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;map_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_csv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;goldenpipe_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run profile → clean → dedupe in one shot.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;run_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop that into a Claude Desktop config and the model now has hands.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this actually unlocks
&lt;/h2&gt;

&lt;p&gt;The interesting part is not "Claude can call dedupe." It is what happens when a planning model can chain these tools against real files in a single conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Conversational data cleaning
&lt;/h3&gt;

&lt;p&gt;A user drags a CSV into Claude and says "make this usable." The agent calls &lt;code&gt;goldencheck_profile&lt;/code&gt;, sees 18% missing zip codes and three different date formats, calls &lt;code&gt;goldenflow_clean&lt;/code&gt;, then &lt;code&gt;goldenmatch_dedupe&lt;/code&gt;, and reports back: &lt;em&gt;"5,426 rows in, 4,891 golden records out, 535 fuzzy duplicates merged. Here are 12 clusters that look low-confidence — want to review them?"&lt;/em&gt; No code written, no docs read.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Schema mapping inside an ETL agent
&lt;/h3&gt;

&lt;p&gt;Today, mapping a vendor's &lt;code&gt;cust_id&lt;/code&gt; to your &lt;code&gt;customer_id&lt;/code&gt; is a human-in-the-loop chore. With infermap exposed over MCP, an agent building an ingestion pipeline can call &lt;code&gt;infermap_map(source, target)&lt;/code&gt;, get a confidence-scored mapping, and only ask the human about the columns it isn't sure about. The boring 80% disappears.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Reverse ETL where the AI is the ETL
&lt;/h3&gt;

&lt;p&gt;Once the agent can both &lt;em&gt;map&lt;/em&gt; and &lt;em&gt;match&lt;/em&gt;, it can take an arbitrary file and merge it into an existing identity store without a pre-written job. That is the underlying bet behind Golden Suite — an autonomous identity layer — and MCP is the surface that lets the agent reach it.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Honest accuracy reporting
&lt;/h3&gt;

&lt;p&gt;Because the tools return structured results (cluster counts, match rates, confidence histograms), the model can quote real numbers instead of inventing them. When an agent says "I deduplicated this," you can verify the claim against the tool output. That is a much better story than "trust the LLM."&lt;/p&gt;

&lt;h2&gt;
  
  
  What stays hard
&lt;/h2&gt;

&lt;p&gt;MCP does not solve everything. A few things still need care:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits.&lt;/strong&gt; Running a 401K-row dedup inside an interactive chat session is a great way to OOM your laptop. The server has to enforce row limits or stream to a background job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth and scoping.&lt;/strong&gt; An agent with a &lt;code&gt;dedupe(path)&lt;/code&gt; tool can read any file the server can read. Path allowlists matter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Determinism.&lt;/strong&gt; LLM-boost paths use embeddings and an LLM tiebreaker — runs need to be reproducible enough that "the agent did it" is auditable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost visibility.&lt;/strong&gt; When the agent triggers a paid LLM-boost step, the user should see it before it happens, not after.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are MCP problems specifically — they are the same problems any agent-callable tool has — but they shape how the server gets built.&lt;/p&gt;

&lt;h2&gt;
  
  
  Future direction
&lt;/h2&gt;

&lt;p&gt;The MCP server is the front door. The interesting roadmap is what sits behind it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — Identity Store as a resource.&lt;/strong&gt; Once the persistent identity store lands, MCP exposes it as a resource the agent can read from and write to. An agent ingesting a new file does not just dedupe within the file — it merges into the canonical store and gets back stable IDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversational correction.&lt;/strong&gt; The paid Golden Suite features are built around correcting the model's mistakes in natural language ("merge these two clusters", "split this one"). MCP makes this a first-class loop: the agent surfaces low-confidence clusters as a prompt, the user corrects them in chat, and the corrections feed back into the matcher's learned config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion connectors as MCP tools.&lt;/strong&gt; The Phase 2 ingestion layer (warehouses, databases, SaaS APIs) becomes a family of MCP tools — &lt;code&gt;snowflake_pull&lt;/code&gt;, &lt;code&gt;salesforce_pull&lt;/code&gt;, &lt;code&gt;postgres_pull&lt;/code&gt; — that hand data straight into the existing pipeline tools. The agent can then say "pull yesterday's leads from Salesforce, dedupe against the identity store, and push the merges back." End-to-end, with no glue code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent pipelines.&lt;/strong&gt; Once each step is an MCP tool, you can run a planner agent that decomposes a high-level goal ("clean and merge all of Q1's vendor files") into parallel sub-agents, each calling the same Golden Suite server. The server becomes the shared substrate; the agents become disposable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Public hosted MCP endpoint.&lt;/strong&gt; Long-term, a hosted Golden Suite MCP server means you don't have to install anything — point your agent host at a URL, authenticate, and you have a data cleaning toolkit. That is the Golden Suite product surface in one sentence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;MCP is the standard wire between AI hosts and real tools — it removes the per-host glue code.&lt;/li&gt;
&lt;li&gt;Golden Suite's package boundaries map almost one-to-one onto MCP tools.&lt;/li&gt;
&lt;li&gt;The unlock is not "Claude can call dedupe" — it is conversational, end-to-end data cleaning where the agent chains profile → clean → match → merge against real data.&lt;/li&gt;
&lt;li&gt;The roadmap points at the identity store, conversational correction, ingestion connectors, and a hosted endpoint — each one extends what an agent can do without leaving the chat.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;Golden Suite is on PyPI today — &lt;code&gt;pip install goldenmatch&lt;/code&gt;, &lt;code&gt;pip install infermap&lt;/code&gt;, &lt;code&gt;pip install goldenpipe&lt;/code&gt;. The MCP server wrapper is the thinnest layer on top, and if you want to point Claude Desktop or Cursor at a local Golden Suite install, the snippet above is a working starting point. Star the &lt;a href="https://github.com/bsevern" rel="noopener noreferrer"&gt;repo&lt;/a&gt;, try the live tools in the &lt;a href="https://dev.to/playground"&gt;playground&lt;/a&gt;, and if you want an MCP-first workflow shipped sooner rather than later — let me know what you would call first.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-07-golden-suite-mcp-servers" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>opensource</category>
      <category>mcp</category>
      <category>aiagents</category>
    </item>
    <item>
      <title>From Dirty CSV to Golden Records: A Python Walkthrough</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:02:43 +0000</pubDate>
      <link>https://dev.to/benzsevern/from-dirty-csv-to-golden-records-a-python-walkthrough-19p7</link>
      <guid>https://dev.to/benzsevern/from-dirty-csv-to-golden-records-a-python-walkthrough-19p7</guid>
      <description>&lt;p&gt;Download a government CSV, load it into pandas, and you'll find "MEMORIAL HOSPITAL" listed twelve times across six states. Run &lt;code&gt;drop_duplicates()&lt;/code&gt; — it finds zero exact copies. Try deduplicating on facility name alone — it merges hospitals that are genuinely different. Data cleaning and deduplication in Python requires more than one-liners. It requires a coordinated pipeline that profiles, cleans, and matches records in sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The Dataset&lt;/li&gt;
&lt;li&gt;Why drop_duplicates() Fails on Real Data&lt;/li&gt;
&lt;li&gt;Zero-Config Data Cleaning in One Line&lt;/li&gt;
&lt;li&gt;Part 1: Explicit Config &amp;amp; Domain Knowledge&lt;/li&gt;
&lt;li&gt;Part 2: LLM Boost — When String Matching Isn't Enough&lt;/li&gt;
&lt;li&gt;Key Takeaways&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Dataset
&lt;/h2&gt;

&lt;p&gt;The CMS Hospital General Information file is a public dataset from &lt;a href="https://data.cms.gov/provider-data/dataset/xubh-q36u" rel="noopener noreferrer"&gt;data.cms.gov&lt;/a&gt; listing every Medicare-certified hospital in the United States. We downloaded the April 2026 snapshot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hospitals.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# (5426, 38)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;5,426 rows. 38 columns. The key fields: &lt;code&gt;facility_name&lt;/code&gt;, &lt;code&gt;address&lt;/code&gt;, &lt;code&gt;citytown&lt;/code&gt;, &lt;code&gt;state&lt;/code&gt;, &lt;code&gt;zip_code&lt;/code&gt;, &lt;code&gt;telephone_number&lt;/code&gt;, &lt;code&gt;hospital_type&lt;/code&gt;, &lt;code&gt;hospital_ownership&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here's a sample of what the raw data looks like:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;facility_name&lt;/th&gt;
&lt;th&gt;address&lt;/th&gt;
&lt;th&gt;citytown&lt;/th&gt;
&lt;th&gt;state&lt;/th&gt;
&lt;th&gt;telephone_number&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MEMORIAL HOSPITAL&lt;/td&gt;
&lt;td&gt;3801 SPRING AVE&lt;/td&gt;
&lt;td&gt;DECATUR&lt;/td&gt;
&lt;td&gt;IL&lt;/td&gt;
&lt;td&gt;(217) 876-8121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MEMORIAL HOSPITAL&lt;/td&gt;
&lt;td&gt;4500 MEMORIAL DR&lt;/td&gt;
&lt;td&gt;BELLEVILLE&lt;/td&gt;
&lt;td&gt;IL&lt;/td&gt;
&lt;td&gt;(618) 233-7750&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MEMORIAL HOSPITAL&lt;/td&gt;
&lt;td&gt;116 EAST 12TH STREET&lt;/td&gt;
&lt;td&gt;JASPER&lt;/td&gt;
&lt;td&gt;IN&lt;/td&gt;
&lt;td&gt;(812) 996-2345&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ST LUKES MEDICAL CENTER&lt;/td&gt;
&lt;td&gt;1800 E VAN BUREN ST&lt;/td&gt;
&lt;td&gt;PHOENIX&lt;/td&gt;
&lt;td&gt;AZ&lt;/td&gt;
&lt;td&gt;(602) 251-8100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLORIDA STATE HOSPITAL UNIT 14 PSYCH&lt;/td&gt;
&lt;td&gt;PO BOX 1000&lt;/td&gt;
&lt;td&gt;CHATTAHOOCHEE&lt;/td&gt;
&lt;td&gt;FL&lt;/td&gt;
&lt;td&gt;(850) 663-7536&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLORIDA STATE HOSPITAL UNIT 31 MED&lt;/td&gt;
&lt;td&gt;PO BOX 1000&lt;/td&gt;
&lt;td&gt;CHATTAHOOCHEE&lt;/td&gt;
&lt;td&gt;FL&lt;/td&gt;
&lt;td&gt;(850) 663-7536&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Phone numbers use &lt;code&gt;(xxx) xxx-xxxx&lt;/code&gt; formatting. Some addresses abbreviate "STREET" as "ST" while others spell it out. The same hospital name appears across multiple states. And in a few cases, the same physical facility shows up as two rows with different unit designations.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why &lt;code&gt;drop_duplicates()&lt;/code&gt; Fails on Real Data
&lt;/h2&gt;

&lt;p&gt;The instinct is to reach for &lt;a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html" rel="noopener noreferrer"&gt;pandas &lt;code&gt;drop_duplicates()&lt;/code&gt;&lt;/a&gt;. Let's try it three ways.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 1: All columns.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hospitals.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dupes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;duplicated&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dupes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Zero exact duplicates. Every row differs on at least one column — different phone format, different whitespace, different unit number. Real-world data almost never has perfect row-level copies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 2: Facility name only.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dupes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;duplicated&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dupes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 131
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;131 rows flagged. But this is wrong in the other direction — 87 hospital names appear more than once because they're genuinely different hospitals in different states. "MEMORIAL HOSPITAL" in Decatur, IL is not the same facility as "MEMORIAL HOSPITAL" in Jasper, IN. Deduplicating on name alone merges records that should stay separate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attempt 3: Manual fuzzy matching.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fuzzywuzzy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;fuzz&lt;/span&gt;

&lt;span class="c1"&gt;# Compare every pair? 5,426 * 5,425 / 2 = 14.7 million comparisons
# Even at 10,000 comparisons/sec, that's 24 minutes
# And you still need to decide: what threshold? which columns? how to merge?
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You could write a custom fuzzy matcher — lowercase everything, strip whitespace, compute Levenshtein ratios. But you'd need to handle blocking (which records to compare), scoring (how to weight name vs address vs phone), and merging (how to pick the canonical record). That's hundreds of lines of brittle code for one dataset.&lt;/p&gt;

&lt;p&gt;The core problem: naive approaches either miss real duplicates or merge records that shouldn't be merged. You need profiling, cleaning, and matching as a coordinated pipeline.&lt;/p&gt;
&lt;h2&gt;
  
  
  Zero-Config Data Cleaning in One Line
&lt;/h2&gt;

&lt;p&gt;GoldenPipe runs the full scan-clean-deduplicate pipeline in a single call. If you're new to GoldenPipe, the &lt;a href="https://bensevern.dev/blog/2026-03-30-getting-started-goldenpipe-python-backend" rel="noopener noreferrer"&gt;getting started guide&lt;/a&gt; covers installation and core concepts.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hospitals.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# "completed"
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# {total: 3.1, check: 0.4, flow: 0.4, match: 2.0}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or from the command line:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;goldenpipe run hospitals.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Click &lt;strong&gt;Run&lt;/strong&gt; to process a sample of the hospital data through the full pipeline. The playground sample includes 5,000 rows with the 11 key columns — the numbers below were generated from the full 38-column dataset.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
result = gp.run("hospitals.csv")

print(result.status)
print(result.timing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;3.1 seconds total. That one call ran scan, clean, and deduplicate across all 5,426 rows. Let's look at each stage.&lt;/p&gt;
&lt;h2&gt;
  
  
  Stage 1: GoldenCheck — Scan
&lt;/h2&gt;

&lt;p&gt;GoldenCheck profiled all 38 columns and reported 155 quality findings in 0.4 seconds.&lt;/p&gt;

&lt;p&gt;
  Click to see the breakdown of the 155 findings
  &lt;br&gt;
| Finding Type | Count | What It Caught |&lt;br&gt;
|---|---|---|&lt;br&gt;
| pattern_consistency | 53 | Phone formats, address abbreviation patterns |&lt;br&gt;
| nullability | 38 | Columns with significant missing values |&lt;br&gt;
| cardinality | 30 | Low-cardinality columns like &lt;code&gt;hospital_type&lt;/code&gt; (8 values) |&lt;br&gt;
| range_distribution | 15 | Numeric outliers in zip codes and CMS ratings |&lt;br&gt;
| type_inference | 10 | Phone/zip stored as strings but parseable as other types |&lt;br&gt;
| drift_detection | 3 | Distribution shifts across data segments |&lt;br&gt;
| null_correlation | 3 | Columns that are null together (correlated missingness) |&lt;br&gt;
| format_detection | 2 | Mixed formatting within single columns |&lt;br&gt;
| uniqueness | 1 | Near-unique columns like &lt;code&gt;facility_id&lt;/code&gt; |&lt;br&gt;


&lt;/p&gt;

&lt;p&gt;The pattern_consistency findings are the most actionable. GoldenCheck detected that all 5,426 phone numbers follow &lt;code&gt;(xxx) xxx-xxxx&lt;/code&gt; formatting — consistent but not normalized. It flagged 82 addresses with mixed abbreviation patterns ("STREET" vs "ST", "AVENUE" vs "AVE") and 52 facility names with inconsistent casing or whitespace.&lt;/p&gt;

&lt;p&gt;GoldenCheck doesn't fix anything — it hands findings to GoldenFlow.&lt;/p&gt;
&lt;h2&gt;
  
  
  Stage 2: GoldenFlow — Clean
&lt;/h2&gt;

&lt;p&gt;GoldenFlow read GoldenCheck's 155 findings and applied targeted transforms. 5,832 cells changed in 0.4 seconds.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Column&lt;/th&gt;
&lt;th&gt;Cells Changed&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;telephone_number&lt;/td&gt;
&lt;td&gt;5,426&lt;/td&gt;
&lt;td&gt;(217) 876-8121&lt;/td&gt;
&lt;td&gt;+12178768121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;address&lt;/td&gt;
&lt;td&gt;82&lt;/td&gt;
&lt;td&gt;116 EAST 12TH STREET&lt;/td&gt;
&lt;td&gt;116 E 12TH ST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;facility_name&lt;/td&gt;
&lt;td&gt;52&lt;/td&gt;
&lt;td&gt;ST LUKES MEDICAL CENTER&lt;/td&gt;
&lt;td&gt;ST LUKES MEDICAL CENTER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hospital_ownership&lt;/td&gt;
&lt;td&gt;271&lt;/td&gt;
&lt;td&gt;Government - Federal&lt;/td&gt;
&lt;td&gt;GOVERNMENT - FEDERAL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Phone normalization:&lt;/strong&gt; Every phone number converted from &lt;code&gt;(xxx) xxx-xxxx&lt;/code&gt; to E.164 (&lt;code&gt;+1xxxxxxxxxx&lt;/code&gt;). This isn't cosmetic — E.164 is the standard for downstream matching, API calls, and database storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Address standardization:&lt;/strong&gt; 82 addresses had inconsistent abbreviations. GoldenFlow normalized "STREET" to "ST", "AVENUE" to "AVE", "BOULEVARD" to "BLVD" — the USPS standard forms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name cleanup:&lt;/strong&gt; 52 facility names had trailing whitespace or double spaces. Invisible to the eye, fatal to exact matching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ownership normalization:&lt;/strong&gt; 271 ownership values standardized to consistent casing. Small change, but it prevents false cardinality inflation downstream.&lt;/p&gt;

&lt;p&gt;Zero config. GoldenFlow used GoldenCheck's findings to decide which transforms were safe to apply automatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  Stage 3: GoldenMatch — Deduplicate (Zero-Config)
&lt;/h2&gt;

&lt;p&gt;GoldenMatch ran entity resolution on the cleaned data. Here are the numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input records&lt;/td&gt;
&lt;td&gt;5,426&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Golden records (cluster representatives)&lt;/td&gt;
&lt;td&gt;479&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Records flagged as duplicates&lt;/td&gt;
&lt;td&gt;1,917&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unique (no matches)&lt;/td&gt;
&lt;td&gt;3,509&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total distinct entities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3,988&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processing time&lt;/td&gt;
&lt;td&gt;2.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;479 clusters. 1,917 records flagged as duplicates. GoldenMatch's internal record count (5,905) differs from the input (5,426) because GoldenFlow's transforms can expand rows when splitting multi-value fields. The match rate is computed against the internal count.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the clusters look like
&lt;/h3&gt;

&lt;p&gt;Here are a few example clusters GoldenMatch produced:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;Records&lt;/th&gt;
&lt;th&gt;facility_name&lt;/th&gt;
&lt;th&gt;state&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;MEMORIAL HOSPITAL&lt;/td&gt;
&lt;td&gt;IL, IN, PA, GA, CO, TX, ...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;COMMUNITY HOSPITAL&lt;/td&gt;
&lt;td&gt;OH, MO, IN, OK, ...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;ST MARY'S HOSPITAL&lt;/td&gt;
&lt;td&gt;MO, WI, MI, NJ, NY&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;REGIONAL MEDICAL CENTER&lt;/td&gt;
&lt;td&gt;AL, MS, TN, SC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Why zero-config over-matched
&lt;/h3&gt;

&lt;p&gt;479 clusters is too many for this dataset. The auto-config built blocking keys on facility name — the most obvious matching column. But hospital names are not unique identifiers. "MEMORIAL HOSPITAL" appears 12 times across different states. They are genuinely different hospitals.&lt;/p&gt;

&lt;p&gt;Without geographic anchoring, GoldenMatch grouped every "MEMORIAL HOSPITAL" into one cluster, every "COMMUNITY HOSPITAL" into another. The auto-config had no way to know that hospitals with the same name in different states are different entities. It did exactly what it was designed to do — match records with similar names — but the domain requires geographic context.&lt;/p&gt;

&lt;p&gt;This is the honest trade-off of zero-config: it's fast and catches obvious patterns, but it can over-match when names are common and geography matters. For hospital data specifically, you need to tell the matcher to only compare records within the same state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ground-truth caveat:&lt;/strong&gt; The CMS dataset has no duplicate labels. These numbers measure how many records GoldenMatch grouped, not verified precision. The 479 clusters include both genuine duplicates and false positives from cross-state name matching. For production use, review borderline pairs with the review queue or &lt;code&gt;goldenmatch evaluate&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Part 1: Explicit Config — Encoding Domain Knowledge
&lt;/h2&gt;

&lt;p&gt;Zero-config over-matched because it lacked geographic context. Let's fix that with an explicit config that encodes what we know about hospital data.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Blocking — Same State Only
&lt;/h3&gt;

&lt;p&gt;The most important change. Instead of comparing all hospitals with similar names, restrict comparisons to hospitals in the same state.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MatchKeyConfig&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;          &lt;span class="c1"&gt;# Pass 1: same state + zip
&lt;/span&gt;            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;    &lt;span class="c1"&gt;# Pass 2: same state + first 3 chars of name
&lt;/span&gt;        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Pass 1 catches hospitals at the same zip code — the tightest geographic net. Pass 2 catches hospitals in the same state with similar names — wider but still geographically anchored. This means "MEMORIAL HOSPITAL" in IL will never be compared to "MEMORIAL HOSPITAL" in IN.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Scoring — Weighted Ensemble
&lt;/h3&gt;

&lt;p&gt;Hospital names carry the most signal, but address and phone provide confirmation.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;matchkeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ensemble&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telephone_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Why these weights?&lt;/strong&gt; Facility name gets 2.0 because it's the primary identifier. Address gets 1.5 with &lt;code&gt;token_sort&lt;/code&gt; because word order varies ("1800 E VAN BUREN ST" vs "1800 EAST VAN BUREN STREET"). Phone gets 0.5 as a confirmation signal — same phone strongly suggests same facility, but different phones don't rule it out (multi-line hospitals). Zip gets 0.3 as a tiebreaker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why 0.80 threshold?&lt;/strong&gt; Hospital abbreviations ("ST" vs "SAINT", "MED CTR" vs "MEDICAL CENTER") drag fuzzy scores down. A threshold of 0.80 catches these while filtering noise.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 3: Run It
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hospitals.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;timing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# {total: 3.0, check: 0.4, flow: 0.4, match: 2.2}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input records&lt;/td&gt;
&lt;td&gt;5,426&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clusters found&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Records flagged as duplicates&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unique (no matches)&lt;/td&gt;
&lt;td&gt;5,414&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total distinct entities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5,420&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processing time&lt;/td&gt;
&lt;td&gt;2.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;6 clusters. Down from 479. The state-based blocking eliminated all the cross-state false positives.&lt;/p&gt;
&lt;h3&gt;
  
  
  The 6 Genuine Clusters
&lt;/h3&gt;

&lt;p&gt;Every cluster is a real same-state match:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cluster&lt;/th&gt;
&lt;th&gt;State&lt;/th&gt;
&lt;th&gt;Records&lt;/th&gt;
&lt;th&gt;What Matched&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Crenshaw Community Hospital&lt;/td&gt;
&lt;td&gt;AL&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same facility, minor address variation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wiregrass Medical Center&lt;/td&gt;
&lt;td&gt;AL&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same facility, data entry differences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bullock County Hospital&lt;/td&gt;
&lt;td&gt;AL&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same facility, different record versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Florida State Hospital (Unit 14 Psych / Unit 31 Med)&lt;/td&gt;
&lt;td&gt;FL&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same campus, different unit designations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Progressive Health Group of Houston&lt;/td&gt;
&lt;td&gt;MS&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same facility, record variants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carthage Area Hospital ("WEST STREET" vs "WEST ST")&lt;/td&gt;
&lt;td&gt;NY&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same facility, address abbreviation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Florida State Hospital cluster is particularly interesting — Unit 14 (Psych) and Unit 31 (Med) are different departments at the same physical campus with the same phone number and PO Box address. Whether these should be merged depends on your use case. For a facility-level analysis, yes. For a department-level analysis, no.&lt;/p&gt;

&lt;p&gt;The Carthage Area Hospital cluster shows exactly the kind of match that &lt;code&gt;drop_duplicates()&lt;/code&gt; misses: "WEST STREET" vs "WEST ST" — same address, different abbreviation.&lt;/p&gt;
&lt;h2&gt;
  
  
  Part 2: LLM Boost — When String Matching Isn't Enough
&lt;/h2&gt;

&lt;p&gt;String matching measures visual similarity. LLMs understand meaning. A hospital rebrand from "County General" to "Mercy Health Partners" has zero string overlap but an LLM can reason about the context. For the theory and mechanics of LLM-assisted deduplication, see the &lt;a href="https://bensevern.dev/blog/2026-03-31-ai-powered-deduplication-llm-boost" rel="noopener noreferrer"&gt;LLM boost deep dive&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the config with LLM scoring enabled:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMScorerConfig&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;matchkeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facility_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ensemble&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;telephone_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;MatchKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;LLMScorerConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;candidate_lo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;candidate_hi&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;calibration_sample_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_cost_usd&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hospitals.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The LLM scorer examines pairs that fall in the "uncertainty zone" — between 0.65 (too low to match) and 0.80 (already matched by fuzzy scoring). These are the borderline cases where string similarity alone can't decide.&lt;/p&gt;
&lt;h3&gt;
  
  
  Results: 0 Additional Pairs
&lt;/h3&gt;

&lt;p&gt;The LLM scored zero additional pairs. Not because it failed — because there were no candidates in the uncertainty zone. Every pair was either above 0.80 (already matched) or below 0.65 (clearly not a match).&lt;/p&gt;

&lt;p&gt;This is the honest story. For well-structured data with strong geographic blocking, explicit config is already so precise that the LLM has nothing to evaluate. The blocking passes constrain comparisons to same-state records, and within a state, hospital names either match clearly or don't match at all. There's no ambiguous middle ground.&lt;/p&gt;
&lt;h3&gt;
  
  
  When LLM Boost Does Help
&lt;/h3&gt;

&lt;p&gt;LLM scoring shines on datasets where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Names have semantic variation:&lt;/strong&gt; "County General Hospital" vs "Mercy Health Partners" (rebrand)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking is looser:&lt;/strong&gt; Blocking on city alone produces more candidate pairs in the uncertainty zone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Abbreviation patterns are inconsistent:&lt;/strong&gt; Some records use "MED CTR" while others use "MEDICAL CENTER" — fuzzy scores land around 0.70-0.78&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual data:&lt;/strong&gt; "Hospital Municipal" vs "City Hospital" — zero string overlap, same entity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the CMS hospital data with state-based blocking, the explicit config already catches everything the LLM would. The $0.50 budget went unspent.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Full Picture
&lt;/h2&gt;

&lt;p&gt;Three approaches on the same 5,426 records:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Zero-Config&lt;/th&gt;
&lt;th&gt;Explicit Config&lt;/th&gt;
&lt;th&gt;Explicit + LLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clusters found&lt;/td&gt;
&lt;td&gt;479&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Records merged&lt;/td&gt;
&lt;td&gt;1,917&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distinct entities&lt;/td&gt;
&lt;td&gt;3,988&lt;/td&gt;
&lt;td&gt;5,420&lt;/td&gt;
&lt;td&gt;5,420&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;3.1s&lt;/td&gt;
&lt;td&gt;3.0s&lt;/td&gt;
&lt;td&gt;3.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config effort&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;~20 lines&lt;/td&gt;
&lt;td&gt;~30 lines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Ground-truth caveat:&lt;/strong&gt; None of these numbers are verified precision — the CMS data has no duplicate labels. The comparison shows relative improvement across approaches. The 479 zero-config clusters are demonstrably inflated (cross-state matching of common names), while the 6 explicit-config clusters pass manual inspection. For production use, verify matches with the review queue or &lt;code&gt;goldenmatch evaluate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The progression tells the real story:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config&lt;/strong&gt; ran the full pipeline in 3.1 seconds with no effort. It caught real patterns (phone normalization, address standardization) but over-matched on deduplication because hospital names repeat across states.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit config&lt;/strong&gt; added 20 lines of domain knowledge — state-based blocking and weighted scoring — and dropped false positives from 479 clusters to 6. Same speed. Dramatically better results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM boost&lt;/strong&gt; found nothing additional on this dataset, which is the correct outcome. The explicit config was already precise enough. On messier data with semantic name variation, the LLM earns its keep.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;drop_duplicates()&lt;/code&gt; barely scratches the surface.&lt;/strong&gt; Zero exact duplicates in 5,426 real hospital records. The duplicates are there — they just don't look identical.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A coordinated pipeline beats three separate scripts.&lt;/strong&gt; GoldenCheck's findings feed GoldenFlow's transforms, which feed GoldenMatch's scoring. Each stage builds on the last.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config gets you started in one line.&lt;/strong&gt; 155 findings, 5,832 cells cleaned, deduplication complete — all in 3.1 seconds. Good enough for exploration and prototyping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config can over-match.&lt;/strong&gt; When names are common and geography matters, auto-blocking without domain context produces false positives. Always inspect the clusters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit config encodes domain knowledge.&lt;/strong&gt; 20 lines of config — state-based blocking + weighted scoring — reduced false positives by 98%. The data tells you what the config should be.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM boost is for the long tail, not every dataset.&lt;/strong&gt; Well-structured data with strong blocking may not need it. Save it for messy, semantic, or multilingual matching problems.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://bensevern.dev/playground/" class="crayons-btn crayons-btn--primary" rel="noopener noreferrer"&gt;Try the GoldenPipe Playground&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On your machine:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;goldenpipe
goldenpipe run hospitals.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Explore the source:&lt;/strong&gt; &lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/benzsevern" rel="noopener noreferrer"&gt;
        benzsevern
      &lt;/a&gt; / &lt;a href="https://github.com/benzsevern/goldenpipe" rel="noopener noreferrer"&gt;
        goldenpipe
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Golden Suite orchestrator — chains validation (GoldenCheck), transformation (GoldenFlow), and entity resolution (GoldenMatch). 4 MCP tools on Smithery. DQBench Pipeline: 88.07.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;GoldenPipe&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Golden Suite orchestrator&lt;/strong&gt; -- Check quality, fix issues, deduplicate records. One command.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/goldenpipe/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/0cb7b5b0a984b57bc72c1edd0cd6d383e2d284a7fe88182aafabe2676937a0a8/68747470733a2f2f696d672e736869656c64732e696f2f707970692f762f676f6c64656e706970653f636f6c6f723d643461303137" alt="PyPI"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/goldenpipe/actions/workflows/test.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/benzsevern/goldenpipe/actions/workflows/test.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a href="https://codecov.io/gh/benzsevern/goldenpipe" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/17775a94b6f1c64037691f745033e90f33c7e76815029d4e10a17f15abbc3667/68747470733a2f2f636f6465636f762e696f2f67682f62656e7a73657665726e2f676f6c64656e706970652f67726170682f62616467652e737667" alt="codecov"&gt;&lt;/a&gt;
&lt;a href="https://pepy.tech/project/goldenpipe" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/30c9f59bfe27c4a47e8f8b51d8227d36a88fa2f0fec2b2f8c42b67c9e2082ede/68747470733a2f2f7374617469632e706570792e746563682f62616467652f676f6c64656e706970652f6d6f6e7468" alt="Downloads"&gt;&lt;/a&gt;
&lt;a href="https://python.org" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/96abf9b704f80578ea56dd10cab0d911c56d46dbec347f431ece9cf60ac175ad/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e31312532422d626c7565" alt="Python 3.11+"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/goldenpipe/LICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/f8df3091bbe1149f398a5369b2c39e896766f9f6efba3477c63e9b4aa940ef14/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d677265656e" alt="License: MIT"&gt;&lt;/a&gt;
&lt;a href="https://benzsevern.github.io/goldenpipe/" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/ff32e0646a54feab435d71acccfb46f1e12d8d24c0155cffe104660b4456254f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f63732d62656e7a73657665726e2e6769746875622e696f253246676f6c64656e706970652d643461303137" alt="Docs"&gt;&lt;/a&gt;
&lt;a href="https://github.com/benzsevern/dqbench" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/011bf64a16a90d1721c4660b65eea6eecdd4aba38daa49b85105804e70b50eb7/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f445142656e6368253230506970656c696e652d38382e30372d676f6c64" alt="DQBench Pipeline"&gt;&lt;/a&gt;
&lt;a href="https://colab.research.google.com/github/benzsevern/goldenpipe/blob/main/scripts/goldenpipe_demo.ipynb" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/eff96fda6b2e0fff8cdf2978f89d61aa434bb98c00453ae23dd0aab8d1451633/68747470733a2f2f636f6c61622e72657365617263682e676f6f676c652e636f6d2f6173736574732f636f6c61622d62616467652e737667" alt="Open In Colab"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What It Does&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;Raw Data
  | GoldenCheck   -- profile &amp;amp; discover quality issues
  | GoldenFlow    -- fix issues, standardize, reshape
  | GoldenMatch   -- deduplicate, match, create golden records
  v
Golden Records
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;GoldenPipe orchestrates the full pipeline with adaptive logic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Skips&lt;/strong&gt; transformation if no quality issues found&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routes&lt;/strong&gt; to privacy-preserving matching if sensitive fields detected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reports&lt;/strong&gt; reasoning for every decision&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Install&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;pip install goldenpipe&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Quick Start&lt;/h2&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-python notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;&lt;span class="pl-k"&gt;import&lt;/span&gt; &lt;span class="pl-s1"&gt;goldenpipe&lt;/span&gt; &lt;span class="pl-k"&gt;as&lt;/span&gt; &lt;span class="pl-s1"&gt;gp&lt;/span&gt;

&lt;span class="pl-s1"&gt;result&lt;/span&gt; &lt;span class="pl-c1"&gt;=&lt;/span&gt; &lt;span class="pl-s1"&gt;gp&lt;/span&gt;.&lt;span class="pl-c1"&gt;run&lt;/span&gt;(&lt;span class="pl-s"&gt;"customers.csv"&lt;/span&gt;)

&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-c1"&gt;status&lt;/span&gt;)        &lt;span class="pl-c"&gt;# "success"&lt;/span&gt;
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-c1"&gt;check&lt;/span&gt;)         &lt;span class="pl-c"&gt;# Quality findings&lt;/span&gt;
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-c1"&gt;transform&lt;/span&gt;)     &lt;span class="pl-c"&gt;# What was fixed&lt;/span&gt;
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-c1"&gt;match&lt;/span&gt;)         &lt;span class="pl-c"&gt;# Deduplicated clusters&lt;/span&gt;
&lt;span class="pl-en"&gt;print&lt;/span&gt;(&lt;span class="pl-s1"&gt;result&lt;/span&gt;.&lt;span class="pl-c1"&gt;reasoning&lt;/span&gt;)     &lt;span class="pl-c"&gt;# Why each decision was made&lt;/span&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;CLI&lt;/h2&gt;

&lt;/div&gt;
&lt;div class="highlight highlight-source-shell notranslate position-relative overflow-auto js-code-highlight"&gt;
&lt;pre&gt;goldenpipe run customers.csv                &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt; Full pipeline&lt;/span&gt;
goldenpipe run customers.csv --verbose      &lt;span class="pl-c"&gt;&lt;span class="pl-c"&gt;#&lt;/span&gt;&lt;/span&gt;&lt;/pre&gt;…
&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/benzsevern/goldenpipe" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-06-dirty-csv-to-golden-records" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>dataengineering</category>
      <category>opensource</category>
    </item>
    <item>
      <title>401K messy equipment records, LLM-calibrated scoring, 12 seconds. Here's how.</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:49:55 +0000</pubDate>
      <link>https://dev.to/benzsevern/401k-messy-equipment-records-llm-calibrated-scoring-12-seconds-heres-how-4f89</link>
      <guid>https://dev.to/benzsevern/401k-messy-equipment-records-llm-calibrated-scoring-12-seconds-heres-how-4f89</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/benzsevern/deduplicating-401000-equipment-auction-records-with-llm-calibration-1knn" class="crayons-story__hidden-navigation-link"&gt;Deduplicating 401,000 Equipment Auction Records with LLM Calibration&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/benzsevern" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" alt="benzsevern profile" class="crayons-avatar__image" width="800" height="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/benzsevern" class="crayons-story__secondary fw-medium m:hidden"&gt;
              benzsevern
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                benzsevern
                
              
              &lt;div id="story-author-preview-content-3454451" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/benzsevern" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" class="crayons-avatar__image" alt="" width="800" height="800"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;benzsevern&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/benzsevern/deduplicating-401000-equipment-auction-records-with-llm-calibration-1knn" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 4&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/benzsevern/deduplicating-401000-equipment-auction-records-with-llm-calibration-1knn" id="article-link-3454451"&gt;
          Deduplicating 401,000 Equipment Auction Records with LLM Calibration
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/benzsevern/deduplicating-401000-equipment-auction-records-with-llm-calibration-1knn#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            6 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>The same 10 data issues show up in every dataset. Here are the one-liner fixes.</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:49:27 +0000</pubDate>
      <link>https://dev.to/benzsevern/the-same-10-data-issues-show-up-in-every-dataset-here-are-the-one-liner-fixes-k3a</link>
      <guid>https://dev.to/benzsevern/the-same-10-data-issues-show-up-in-every-dataset-here-are-the-one-liner-fixes-k3a</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/benzsevern/10-data-problems-every-pipeline-hits-and-the-one-liner-fixes-2391" class="crayons-story__hidden-navigation-link"&gt;10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/benzsevern" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" alt="benzsevern profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/benzsevern" class="crayons-story__secondary fw-medium m:hidden"&gt;
              benzsevern
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                benzsevern
                
              
              &lt;div id="story-author-preview-content-3454426" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/benzsevern" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;benzsevern&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/benzsevern/10-data-problems-every-pipeline-hits-and-the-one-liner-fixes-2391" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 4&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/benzsevern/10-data-problems-every-pipeline-hits-and-the-one-liner-fixes-2391" id="article-link-3454426"&gt;
          10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/tutorial"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;tutorial&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/benzsevern/10-data-problems-every-pipeline-hits-and-the-one-liner-fixes-2391#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            4 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Benchmarked 4 Python dedup libraries on the same dataset. Results surprised me.</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:48:56 +0000</pubDate>
      <link>https://dev.to/benzsevern/benchmarked-4-python-dedup-libraries-on-the-same-dataset-results-surprised-me-e3f</link>
      <guid>https://dev.to/benzsevern/benchmarked-4-python-dedup-libraries-on-the-same-dataset-results-surprised-me-e3f</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf" class="crayons-story__hidden-navigation-link"&gt;GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/benzsevern" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" alt="benzsevern profile" class="crayons-avatar__image" width="800" height="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/benzsevern" class="crayons-story__secondary fw-medium m:hidden"&gt;
              benzsevern
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                benzsevern
                
              
              &lt;div id="story-author-preview-content-3454457" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/benzsevern" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3837729%2Fca3cd3ca-dea6-498e-bb37-41406d247cbd.JPEG" class="crayons-avatar__image" alt="" width="800" height="800"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;benzsevern&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Apr 4&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf" id="article-link-3454457"&gt;
          GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/python"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;python&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/datascience"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;datascience&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/opensource"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;opensource&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/dataengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;dataengineering&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            8 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:11:08 +0000</pubDate>
      <link>https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf</link>
      <guid>https://dev.to/benzsevern/goldenmatch-vs-splink-vs-dedupe-vs-recordlinkage-a-practical-comparison-1akf</guid>
      <description>&lt;p&gt;There are four serious Python libraries for entity resolution. They make fundamentally different bets — about how much you should configure, how training should work, what scale means, and how much the library should do for you. We ran all four on the same three datasets to find out where those bets pay off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on fairness:&lt;/strong&gt; GoldenMatch is ours. We tried to be even-handed — same datasets, same evaluation code, same machine, best reasonable config for each library. Every script is published in our &lt;a href="https://github.com/bsevern/golden-showcase/tree/main/comparison_bench" rel="noopener noreferrer"&gt;comparison benchmark repo&lt;/a&gt;. If we got something wrong, open a PR.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contenders
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GoldenMatch&lt;/strong&gt; is a configuration-driven deduplication engine. You define blocking rules and weighted match keys; it handles scoring, clustering, and optional LLM calibration. No training data needed, but you &lt;em&gt;do&lt;/em&gt; need to write explicit config — auto-config failed on all three datasets in this benchmark.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Splink&lt;/strong&gt; is a probabilistic record linkage library built on the Fellegi-Sunter model. It uses DuckDB (or Spark/Athena) as a SQL backend for scale, estimates match weights via expectation-maximisation, and produces calibrated match probabilities. The most statistically rigorous option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dedupe&lt;/strong&gt; is the oldest of the four. It uses active learning — you label pairs interactively, it trains a classifier, then partitions your data. Powerful in theory, but the interactive labeling requirement makes automation harder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RecordLinkage&lt;/strong&gt; provides a clean, scikit-learn-style API for building linkage pipelines: indexer, comparator, classifier. Straightforward and well-documented, but the project hasn't been updated since July 2023.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Training Data&lt;/th&gt;
&lt;th&gt;Scale Strategy&lt;/th&gt;
&lt;th&gt;Last Release&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GoldenMatch&lt;/td&gt;
&lt;td&gt;Config-driven weighted scoring&lt;/td&gt;
&lt;td&gt;None required&lt;/td&gt;
&lt;td&gt;In-memory + ANN blocking&lt;/td&gt;
&lt;td&gt;Active (2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splink&lt;/td&gt;
&lt;td&gt;Fellegi-Sunter EM&lt;/td&gt;
&lt;td&gt;Unsupervised (EM)&lt;/td&gt;
&lt;td&gt;SQL backends (DuckDB/Spark)&lt;/td&gt;
&lt;td&gt;Active (2026)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedupe&lt;/td&gt;
&lt;td&gt;Active learning classifier&lt;/td&gt;
&lt;td&gt;Interactive labeling&lt;/td&gt;
&lt;td&gt;Disk-backed&lt;/td&gt;
&lt;td&gt;Active (2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RecordLinkage&lt;/td&gt;
&lt;td&gt;Indexer + Compare + Classify&lt;/td&gt;
&lt;td&gt;Optional (unsupervised default)&lt;/td&gt;
&lt;td&gt;In-memory&lt;/td&gt;
&lt;td&gt;Unmaintained (2023)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Datasets
&lt;/h2&gt;

&lt;p&gt;We chose three datasets that test different things:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Records&lt;/th&gt;
&lt;th&gt;True Matches&lt;/th&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;What It Tests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Febrl&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;6,538 pairs&lt;/td&gt;
&lt;td&gt;Synthetic personal records&lt;/td&gt;
&lt;td&gt;PII matching: names, dates, addresses, postcodes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DBLP-ACM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4,910&lt;/td&gt;
&lt;td&gt;2,224&lt;/td&gt;
&lt;td&gt;Bibliographic records&lt;/td&gt;
&lt;td&gt;Non-PII matching: paper titles, authors, venues, years&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;NC Voter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10,000 sample&lt;/td&gt;
&lt;td&gt;None (no ground truth)&lt;/td&gt;
&lt;td&gt;Real voter registration&lt;/td&gt;
&lt;td&gt;Scale and robustness on messy real-world data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Febrl is the easy warm-up — synthetic PII with controlled noise. DBLP-ACM is harder: paper titles require semantic understanding, author lists vary in format, and venue names are inconsistent. NC Voter is the real-world stress test.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results at a Glance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accuracy — Febrl (5,000 synthetic personal records)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Splink&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.995&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.998&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GoldenMatch&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.943&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.971&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedupe&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.865&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.928&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RecordLinkage&lt;/td&gt;
&lt;td&gt;0.999&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.845&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Accuracy — DBLP-ACM (4,910 bibliographic records, 2,224 true matches)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RecordLinkage&lt;/td&gt;
&lt;td&gt;0.888&lt;/td&gt;
&lt;td&gt;0.961&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.923&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;13.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GoldenMatch&lt;/td&gt;
&lt;td&gt;0.891&lt;/td&gt;
&lt;td&gt;0.945&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.918&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedupe&lt;/td&gt;
&lt;td&gt;0.604&lt;/td&gt;
&lt;td&gt;0.936&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.734&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splink&lt;/td&gt;
&lt;td&gt;0.646&lt;/td&gt;
&lt;td&gt;0.834&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.728&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.4s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Scale — NC Voter (10K sample, no ground truth)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Library&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Multi-record&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Splink&lt;/td&gt;
&lt;td&gt;6.9s&lt;/td&gt;
&lt;td&gt;9,996&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;10.0 MB&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GoldenMatch&lt;/td&gt;
&lt;td&gt;8.0s&lt;/td&gt;
&lt;td&gt;918&lt;/td&gt;
&lt;td&gt;918&lt;/td&gt;
&lt;td&gt;55.7 MB&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RecordLinkage&lt;/td&gt;
&lt;td&gt;22.7s&lt;/td&gt;
&lt;td&gt;1,462&lt;/td&gt;
&lt;td&gt;1,462&lt;/td&gt;
&lt;td&gt;101.3 MB&lt;/td&gt;
&lt;td&gt;Completed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dedupe&lt;/td&gt;
&lt;td&gt;268s&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Failed (disk space exhaustion)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Accuracy Deep-Dive
&lt;/h2&gt;

&lt;p&gt;The headline finding: &lt;strong&gt;no library wins everywhere&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;On Febrl, Splink dominates. Its Fellegi-Sunter model is purpose-built for PII — names, dates, addresses are exactly the field types where EM weight estimation shines. An F1 of 0.998 on 5,000 records is near-perfect. GoldenMatch's 0.971 is strong but behind, mostly due to lower recall (0.943 vs. 0.995). Splink's probabilistic approach catches more fuzzy matches that fall below GoldenMatch's weighted threshold.&lt;/p&gt;

&lt;p&gt;On DBLP-ACM, the rankings &lt;em&gt;flip&lt;/em&gt;. Splink drops to 0.728 F1 — its EM training struggles when the data doesn't fit clean PII patterns. Paper titles, author lists, and venue abbreviations don't decompose into the kind of comparison levels that Fellegi-Sunter expects. RecordLinkage takes the top spot at 0.923, just ahead of GoldenMatch at 0.918. RecordLinkage's KMeans classifier finds a clean decision boundary in the feature space without needing field-specific statistical models.&lt;/p&gt;

&lt;p&gt;GoldenMatch is the most &lt;em&gt;consistent&lt;/em&gt; performer: second on Febrl (0.971), second on DBLP-ACM (0.918). It doesn't win either dataset outright, but it never drops below 0.91. That consistency matters if you're working across data types and don't want to switch libraries per project.&lt;/p&gt;

&lt;p&gt;Dedupe's DBLP-ACM precision (0.604) is concerning — it's matching a lot of records that aren't actually duplicates. Its recall is fine (0.936), but the classifier trained on pre-labeled pairs seems to have learned an overly generous boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup Effort
&lt;/h2&gt;

&lt;p&gt;Raw line counts are similar across the Febrl scripts (81–109 lines including shared boilerplate). But the &lt;em&gt;nature&lt;/em&gt; of the configuration differs meaningfully. Here's the library-specific core for each:&lt;/p&gt;

&lt;h3&gt;
  
  
  GoldenMatch (~30 lines of config)
&lt;/h3&gt;

&lt;p&gt;You define blocking passes and weighted match fields. No training step — scores are deterministic from config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch.config.schemas&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MatchkeyConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GoldenMatchConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;blocking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;BlockingConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi_pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;passes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soundex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soundex&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[]),&lt;/span&gt;
            &lt;span class="nc"&gt;BlockingKeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[]),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;max_block_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;skip_oversized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;matchkeys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;MatchkeyConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weighted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaro_winkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jaro_winkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transforms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="nc"&gt;MatchkeyField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;field&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scorer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;goldenmatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you need to know: blocking field selection, which scorer fits which field type, weight tuning. The config is verbose but declarative — no hidden state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest caveat:&lt;/strong&gt; GoldenMatch's auto-config (&lt;code&gt;dedupe_df(df)&lt;/code&gt; with no config) failed on &lt;em&gt;all three datasets&lt;/em&gt;. On Febrl it misclassified fields; on DBLP-ACM it couldn't infer blocking rules for bibliographic data; on NC Voter it produced poor results. Explicit config was required every time. This is the single biggest usability gap we found.&lt;/p&gt;

&lt;h3&gt;
  
  
  Splink (~40 lines of config + training)
&lt;/h3&gt;

&lt;p&gt;You define comparison levels, blocking rules, then run EM training to estimate match weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;splink&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Linker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SettingsCreator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DuckDBAPI&lt;/span&gt;

&lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SettingsCreator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;link_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dedupe_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;unique_id_column_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rec_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comparisons&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JaroWinklerAtThresholds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JaroWinklerAtThresholds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LevenshteinAtThresholds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExactMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;soc_sec_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LevenshteinAtThresholds&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;cl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ExactMatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;blocking_rules_to_generate_predictions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;linker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Linker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DuckDBAPI&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;linker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate_u_using_random_sampling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_pairs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e6&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;linker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate_parameters_using_expectation_maximisation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fix_u_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;linker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate_parameters_using_expectation_maximisation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;fix_u_probabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;preds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inference&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;threshold_match_probability&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;linker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clustering&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cluster_pairwise_predictions_at_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold_match_probability&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you need to know: comparison levels (thresholds per field), which blocking rules to use for EM training (they must be &lt;em&gt;different&lt;/em&gt; from prediction blocking), and the two-phase EM estimation pattern. The config surface area is larger than GoldenMatch, but you get calibrated probabilities in return.&lt;/p&gt;

&lt;h3&gt;
  
  
  RecordLinkage (~25 lines of config)
&lt;/h3&gt;

&lt;p&gt;The cleanest API of the four. Indexer, compare, classify — three steps.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;recordlinkage.classifiers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KMeansClassifier&lt;/span&gt;

&lt;span class="n"&gt;indexer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;recordlinkage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Index&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;indexer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname_block&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# first 3 chars of surname
&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indexer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;compare&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;recordlinkage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Compare&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jarowinkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jarowinkler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;levenshtein&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;compare&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KMeansClassifier&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you need to know: indexer selection (blocking, sorted neighbourhood, full index), comparison methods, classifier choice. The API is intuitive if you've used scikit-learn. The downside: single blocking pass means you miss matches outside that block, which explains the lower Febrl recall (0.733).&lt;/p&gt;

&lt;h3&gt;
  
  
  Dedupe (~60 lines of config + data conversion)
&lt;/h3&gt;

&lt;p&gt;The most involved setup. You define variables, convert your DataFrame to Dedupe's dict format, provide training pairs, train, then partition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;variables&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;given_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ShortString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ShortString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Convert DataFrame to dict format (Dedupe requirement)
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rec_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;deduper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dedupe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Dedupe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deduper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prepare_training&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;training_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# pre-labeled pairs
&lt;/span&gt;&lt;span class="n"&gt;deduper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;clusters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deduper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you need to know: Dedupe &lt;em&gt;requires&lt;/em&gt; labeled training pairs. By default it launches an interactive console session where you label pairs one at a time. For automation, you need to pre-generate a training JSON file (which is what we did). The DataFrame-to-dict conversion is also a friction point — every other library accepts DataFrames directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scale
&lt;/h2&gt;

&lt;p&gt;The NC Voter dataset is 10,000 real voter registration records (sampled from 208K — full-scale test pending). No ground truth, so we can't measure accuracy, but we can measure speed, memory, and whether the library survives at all.&lt;/p&gt;

&lt;p&gt;Splink is the fastest at 6.9s and the most memory-efficient at 10.0 MB — its DuckDB backend handles blocking and comparison in SQL, keeping the Python memory footprint minimal. It found only 4 multi-record clusters though, which is surprisingly conservative for voter data with common names and addresses.&lt;/p&gt;

&lt;p&gt;GoldenMatch completed in 8.0s with 918 clusters. Higher memory usage (55.7 MB) since it works in-memory, but reasonable for 10K records.&lt;/p&gt;

&lt;p&gt;RecordLinkage completed but took 22.7s and used 101.3 MB. The in-memory pair comparison doesn't scale as efficiently as SQL-backed approaches.&lt;/p&gt;

&lt;p&gt;Dedupe failed after 268 seconds with a disk space exhaustion error. Its disk-backed approach generates intermediate files during training and partition — on a 10K dataset, that shouldn't be a problem, but it was. This is a significant reliability concern for production use.&lt;/p&gt;

&lt;p&gt;Note: this was a 10K sample. At 208K records, the performance gaps would widen substantially. We expect Splink's SQL backend to handle it well; GoldenMatch should manage with ANN blocking; RecordLinkage and Dedupe would likely struggle.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Pick What
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Pick GoldenMatch if&lt;/strong&gt; you want consistent accuracy across data types without training data. It placed top-2 on both Febrl (F1=0.971) and DBLP-ACM (F1=0.918) — the only library that stayed competitive across PII and non-PII domains. The optional &lt;a href="https://dev.to/blog/ai-powered-deduplication-llm-boost"&gt;LLM calibration&lt;/a&gt; can push accuracy further in production. But know that you &lt;em&gt;will&lt;/em&gt; need to write explicit config — auto-config is not ready for real workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Splink if&lt;/strong&gt; your data is PII-heavy — names, dates, addresses, identifiers. On that kind of data, its Fellegi-Sunter model is hard to beat (Febrl F1=0.998). The DuckDB/Spark backends give you a real path to millions of records. Config is verbose but well-documented. Just be aware it may underperform on non-standard domains (DBLP-ACM F1=0.728).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick Dedupe if&lt;/strong&gt; you have labeled training data and want the active learning workflow. In theory, human-in-the-loop labeling should produce the best classifier for your specific domain. In practice, the interactive labeling requirement makes automation painful, it was the slowest library on every dataset, and it failed outright on NC Voter. Best suited for one-off dedup projects where you can sit and label pairs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick RecordLinkage if&lt;/strong&gt; you want the simplest API and your data is structured. It surprised us on DBLP-ACM (F1=0.923, best of the four) and the three-step pipeline is easy to reason about. The concern: the project is unmaintained since July 2023. No new releases, no bug fixes, no security patches. Fine for experiments and internal tools — risky for production dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Didn't Test
&lt;/h2&gt;

&lt;p&gt;These results are "best reasonable config" — we spent a few hours tuning each library, not days. An expert in any one of these libraries could likely improve its numbers. We also didn't test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM-boosted GoldenMatch (which would likely improve recall on both datasets)&lt;/li&gt;
&lt;li&gt;Splink with Spark backend at full NC Voter scale (208K)&lt;/li&gt;
&lt;li&gt;Dedupe with extensive interactive labeling (we used pre-generated pairs)&lt;/li&gt;
&lt;li&gt;Multi-pass blocking for RecordLinkage (which would improve its Febrl recall)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Get Started
&lt;/h2&gt;

&lt;p&gt;Try GoldenMatch on your own data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;goldenmatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the &lt;a href="https://dev.to/try"&gt;interactive playground&lt;/a&gt; to test configurations without writing code.&lt;/p&gt;

&lt;p&gt;For more GoldenMatch benchmarks, see our &lt;a href="https://dev.to/blog/goldenmatch-bpid-benchmark"&gt;BPID benchmark post&lt;/a&gt; (adversarial PII matching) and the &lt;a href="https://dev.to/blog/equipment-dedup-bulldozer-401k"&gt;equipment deduplication case study&lt;/a&gt; (401K real auction records).&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-03-goldenmatch-vs-splink-dedupe-recordlinkage" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark</title>
      <dc:creator>benzsevern</dc:creator>
      <pubDate>Sat, 04 Apr 2026 17:10:52 +0000</pubDate>
      <link>https://dev.to/benzsevern/goldenmatch-vs-bpid-testing-against-an-emnlp-benchmark-ndl</link>
      <guid>https://dev.to/benzsevern/goldenmatch-vs-bpid-testing-against-an-emnlp-benchmark-ndl</guid>
      <description>&lt;p&gt;How well does your deduplication tool handle profiles that are &lt;em&gt;designed&lt;/em&gt; to fool it?&lt;/p&gt;

&lt;p&gt;Amazon published &lt;a href="https://aclanthology.org/2024.emnlp-industry.40/" rel="noopener noreferrer"&gt;BPID&lt;/a&gt; (Benchmark for Personal Identity Deduplication) at EMNLP 2024 — the first open-source benchmark specifically for PII matching. It includes 10,000 profile pairs where even GPT-4 and fine-tuned BERT models struggle to tell matches from non-matches.&lt;/p&gt;

&lt;p&gt;We ran GoldenMatch against it. No training data, no fine-tuning. Just string similarity primitives, date parsing, and Vertex AI embeddings.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes BPID Hard
&lt;/h2&gt;

&lt;p&gt;Most entity resolution benchmarks (DBLP-ACM, Abt-Buy, Febrl) test whether your system can find similar records. BPID tests whether it can &lt;em&gt;not&lt;/em&gt; match records that look similar but aren't.&lt;/p&gt;

&lt;p&gt;Each profile has five attributes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;fullname&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Free text&lt;/td&gt;
&lt;td&gt;Nicknames (Bill/William), gender variants (Daniel/Danielle), reordering (Smith John -&amp;gt; John Smith)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;email&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List of addresses&lt;/td&gt;
&lt;td&gt;Shared domains, similar usernames across different people&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phone&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List of numbers&lt;/td&gt;
&lt;td&gt;Country code variations, partial numbers, formatting noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;addr&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List of addresses&lt;/td&gt;
&lt;td&gt;Same street different state, semantic variations (100th vs one hundredth)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dob&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Free text date&lt;/td&gt;
&lt;td&gt;Format variations (1990-11-14 vs 14 nov 1990), partial dates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The dataset has &lt;strong&gt;4,333 match pairs&lt;/strong&gt; and &lt;strong&gt;5,667 no-match pairs&lt;/strong&gt;. The no-match pairs are intentionally adversarial — two different people named "Damien Skinner" and "Skinner Damien" sharing an email address and phone number, but with contradicting birthdates. A naive string similarity approach will confidently match them.&lt;/p&gt;

&lt;p&gt;On top of that, ~18% of attribute values are missing. Some profiles have a single-letter name and no email. You get a &lt;code&gt;fullname&lt;/code&gt; of "b" paired with "marshal jennifer bivens" — and they're labeled as a match.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Published Baselines
&lt;/h2&gt;

&lt;p&gt;The BPID paper benchmarked several methods:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest&lt;/td&gt;
&lt;td&gt;Traditional (hand-crafted features)&lt;/td&gt;
&lt;td&gt;0.653&lt;/td&gt;
&lt;td&gt;0.609&lt;/td&gt;
&lt;td&gt;0.629&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ditto&lt;/td&gt;
&lt;td&gt;Pre-trained language model&lt;/td&gt;
&lt;td&gt;0.746&lt;/td&gt;
&lt;td&gt;0.804&lt;/td&gt;
&lt;td&gt;0.752&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sudowoodo&lt;/td&gt;
&lt;td&gt;Pre-trained language model (SOTA)&lt;/td&gt;
&lt;td&gt;0.774&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.788&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Random Forest uses hand-engineered string similarity features. Ditto and Sudowoodo are BERT-based models fine-tuned on labeled pairs. Even Claude 3 Sonnet and GPT-4 Turbo were tested — LLMs scored well but still made systematic errors on phone number digit comparison (tokenization struggles with exact digit counts).&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Approach
&lt;/h2&gt;

&lt;p&gt;GoldenMatch wasn't designed for BPID's pair classification format. It's a deduplication engine — you feed it a table of records and it finds clusters. So we adapted its scoring primitives for pairwise comparison and iterated through three configurations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config 1: Naive Weighted Scoring (0.665 F1)
&lt;/h3&gt;

&lt;p&gt;Our first pass used GoldenMatch's field-level primitives (&lt;code&gt;score_field&lt;/code&gt;, &lt;code&gt;apply_transforms&lt;/code&gt;) with a list-aware scorer. BPID profiles have multi-valued fields (lists of emails, phones, addresses), so we score each element pair and take the maximum.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rapidfuzz.distance&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JaroWinkler&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rapidfuzz.fuzz&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;token_sort_ratio&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ensemble_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;GoldenMatch ensemble: max(jaro_winkler, token_sort, soundex*0.8)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;jw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JaroWinkler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;token_sort_ratio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;
    &lt;span class="n"&gt;sx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;jellyfish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;soundex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;jellyfish&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;soundex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sx&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For identifier fields (email, phone), we check for exact overlap first — one shared email between two profiles is a strong match signal per BPID's annotation rules. The final score is a weighted average across all available fields.&lt;/p&gt;

&lt;p&gt;This gave us &lt;strong&gt;0.665 F1&lt;/strong&gt; — above the Random Forest baseline (0.629), but the score distribution told us why it wasn't higher:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Match pairs:    mean=0.828
No-match pairs: mean=0.715
Gap:            0.113
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only 0.11 separation. Some no-match pairs score a perfect 1.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config 2: Optimized Classical Scoring (0.747 F1)
&lt;/h3&gt;

&lt;p&gt;The breakthrough was &lt;strong&gt;proper DOB parsing&lt;/strong&gt;. Our naive scorer compared raw digit strings — "14 nov 1953" and "1953-11-14" produce different digit sequences despite being the same date. We built a date parser that extracts (year, month, day) components from free-text dates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_dob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dob&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Parse free-text DOB into (year, month, day) components.

    Handles: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1953 11 09&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;09 nov 1953&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;19530911&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
             &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;nov 1953&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;09 2007&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;jul 18sat 1953&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract month names, then parse remaining numbers
&lt;/span&gt;    &lt;span class="c1"&gt;# Try YYYYMMDD, then positional disambiguation
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With parsed components, a contradicting birth &lt;strong&gt;year&lt;/strong&gt; is a near-certain no-match signal — different people share names and addresses, but rarely share a birthdate. We weighted year contradictions at 2.5x.&lt;/p&gt;

&lt;p&gt;We also improved phone normalization (strip country codes, compare last 10 digits) and name scoring (first-name extraction to detect gender swaps like Daniel/Danielle).&lt;/p&gt;

&lt;p&gt;The result: &lt;strong&gt;0.747 F1&lt;/strong&gt; — a +0.08 jump from DOB parsing alone.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Match pairs:    mean=0.899
No-match pairs: mean=0.715
Gap:            0.184
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The score gap nearly doubled. Precision jumped from 0.541 to 0.655, eliminating ~1,200 false positives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Config 3: Classical + Vertex AI Embeddings (0.750 F1)
&lt;/h3&gt;

&lt;p&gt;We embedded all 20,000 profiles using Vertex AI's &lt;code&gt;text-embedding-004&lt;/code&gt; (768 dimensions) and computed cosine similarity for each pair. Embeddings alone scored &lt;strong&gt;0.658 F1&lt;/strong&gt; — worse than classical scoring because the embedding gap was only 0.062 (adversarial profiles are semantically similar by design).&lt;/p&gt;

&lt;p&gt;But blending 65% classical + 35% embedding produced &lt;strong&gt;0.750 F1&lt;/strong&gt; — a small but real improvement. The embedding captures semantic relationships that string matching misses (Bill/William, abbreviated addresses) while the classical scorer provides the structural discrimination (DOB parsing, exact identifier overlap).&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Precision&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;F1&lt;/th&gt;
&lt;th&gt;Training Data&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Random Forest (BPID paper)&lt;/td&gt;
&lt;td&gt;0.653&lt;/td&gt;
&lt;td&gt;0.609&lt;/td&gt;
&lt;td&gt;0.629&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GoldenMatch classical&lt;/td&gt;
&lt;td&gt;0.655&lt;/td&gt;
&lt;td&gt;0.869&lt;/td&gt;
&lt;td&gt;0.747&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.2s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GoldenMatch + embeddings&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.672&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.849&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.750&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~8min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ditto (BPID paper)&lt;/td&gt;
&lt;td&gt;0.746&lt;/td&gt;
&lt;td&gt;0.804&lt;/td&gt;
&lt;td&gt;0.752&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sudowoodo (BPID paper)&lt;/td&gt;
&lt;td&gt;0.774&lt;/td&gt;
&lt;td&gt;0.802&lt;/td&gt;
&lt;td&gt;0.788&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GoldenMatch matches Ditto (0.750 vs 0.752) with &lt;strong&gt;zero training data&lt;/strong&gt;. The gap to Sudowoodo (0.788) remains — fine-tuned BERT models that learn PII-specific representations still have an edge on adversarial data.&lt;/p&gt;

&lt;p&gt;Note the precision-recall balance: GoldenMatch trades higher recall (0.849-0.869) for lower precision (0.655-0.672) compared to the PLMs. In production, this tradeoff is tunable via the threshold — at t=0.87, GoldenMatch hits 0.718 precision / 0.734 recall / 0.726 F1.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Made the Difference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DOB parsing was the single biggest lever
&lt;/h3&gt;

&lt;p&gt;Going from raw digit comparison to parsed (year, month, day) components was worth &lt;strong&gt;+0.08 F1&lt;/strong&gt;. A birth year contradiction is the strongest no-match signal in PII data — stronger than different names (people change names) or different addresses (people move).&lt;/p&gt;

&lt;h3&gt;
  
  
  Embeddings help, but not as much as you'd think
&lt;/h3&gt;

&lt;p&gt;Vertex AI embeddings added only +0.003 F1 on top of the optimized classical scorer. The reason: BPID's adversarial pairs are &lt;em&gt;designed&lt;/em&gt; to be semantically similar. "Daniel" and "Danielle" are close in embedding space. The embedding helps most on genuine matches with unusual formatting, but can't reject the traps.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-valued fields need max-over-pairs scoring
&lt;/h3&gt;

&lt;p&gt;BPID profiles have lists of emails, phones, and addresses. Concatenating them into a single string and running Jaro-Winkler produces poor results. Scoring each element pair and taking the maximum matches BPID's annotation rule: "one shared element = match for that attribute."&lt;/p&gt;

&lt;h3&gt;
  
  
  First-name extraction catches gender swaps
&lt;/h3&gt;

&lt;p&gt;BPID includes deliberate negative name pairs: Daniel/Danielle, Jon/John, Mary/Mark. The ensemble scorer gives these high similarity (~0.85+). Extracting tokens and checking that at least one name token matches well across profiles catches many of these.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM boost actually hurt performance
&lt;/h3&gt;

&lt;p&gt;We sent 4,747 borderline pairs (hybrid score 0.66-0.86) to GPT-4.1-mini. The result surprised us: F1 dropped from 0.750 to 0.737. The LLM achieved only 60.7% accuracy on borderline pairs — barely better than random. It said "yes" to 2,646 of 4,747 pairs, creating more false positives than it eliminated.&lt;/p&gt;

&lt;p&gt;Why? The same adversarial design that makes BPID hard for string matchers also tricks LLMs. Two profiles with the same name, similar emails, and overlapping phone numbers &lt;em&gt;look&lt;/em&gt; like a match to a language model — it can't reliably detect that the birthdates contradict or that the phone numbers differ by exactly the last four digits. The BPID paper observed the same pattern: even GPT-4 Turbo and Claude 3 Sonnet make systematic errors on digit comparison because tokenization obscures exact digit counts.&lt;/p&gt;

&lt;p&gt;The lesson: on adversarial PII data, structured feature engineering (parsing dates into components, normalizing phone numbers, checking first-name tokens) outperforms LLM reasoning. The LLM adds value on &lt;a href="https://dev.to/blog/ai-powered-deduplication-llm-boost"&gt;real-world data&lt;/a&gt; where the challenge is variety, not adversarial traps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running This Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;goldenmatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Download BPID from &lt;a href="https://zenodo.org/records/13932202" rel="noopener noreferrer"&gt;Zenodo&lt;/a&gt; (Apache 2.0 license, 58MB).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch.core.scorer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;score_field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;goldenmatch.utils.transforms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;apply_transforms&lt;/span&gt;

&lt;span class="c1"&gt;# Load BPID matching dataset
&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;matching_dataset.jsonl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Score a pair with GoldenMatch primitives
&lt;/span&gt;&lt;span class="n"&gt;name_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_transforms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;corrie arreola&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;name_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_transforms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arreola corrie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;score_field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 1.0 — handles reordering
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full benchmark scripts (naive, optimized, embedding, LLM boost) are in the &lt;a href="https://github.com/benzsevern/golden-showcase" rel="noopener noreferrer"&gt;bpid_bench&lt;/a&gt; directory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where GoldenMatch Fits
&lt;/h2&gt;

&lt;p&gt;BPID is a pair classification benchmark — given two profiles, decide match or no-match. GoldenMatch is built for a different task: given a table of N records, find all duplicate clusters. The pair scoring approach here uses GoldenMatch's primitives outside their normal pipeline context.&lt;/p&gt;

&lt;p&gt;For production PII deduplication, GoldenMatch's pipeline adds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blocking&lt;/strong&gt; — reduces O(N^2) comparisons to manageable candidate sets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt; (Union-Find) — produces transitive groups, not just pairwise decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Golden rules&lt;/strong&gt; — merges clusters into canonical records&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM calibration&lt;/strong&gt; — handles borderline pairs for ~$0.01&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On structured data with standard blocking keys, GoldenMatch hits &lt;a href="https://dev.to/blog/entity-resolution-real-data-nc-voters"&gt;97.2% F1 on DBLP-ACM&lt;/a&gt; and processes &lt;a href="https://dev.to/blog/equipment-dedup-bulldozer-401k"&gt;401K records in under 30 seconds&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;BPID tests a specific, adversarial corner: PII profiles with intentional near-miss traps. GoldenMatch matches Ditto's F1 without training data — and the classical scorer runs in 0.2 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GoldenMatch scores 0.750 F1 on BPID&lt;/strong&gt; — matching Ditto (0.752), above Random Forest (0.629)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero training data&lt;/strong&gt; — no labeled pairs, no fine-tuning, no GPU training&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOB parsing was the biggest win&lt;/strong&gt; — proper date component extraction added +0.08 F1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embeddings provide marginal gains&lt;/strong&gt; — Vertex AI embeddings added +0.003 F1 over classical scoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.2 seconds for classical scoring&lt;/strong&gt; — 41,000+ pairs/sec on a laptop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The precision-recall tradeoff is tunable&lt;/strong&gt; — adjust threshold for your use case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Try GoldenMatch on your own data: &lt;code&gt;pip install goldenmatch&lt;/code&gt; or &lt;a href="https://dev.to/playground?tool=goldenmatch"&gt;try it in the playground&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://bensevern.dev/blog/2026-04-02-goldenmatch-bpid-benchmark" rel="noopener noreferrer"&gt;https://bensevern.dev&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>opensource</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
