DEV Community: benzsevern

Wagner Was on OFAC in 2018: What 10 Years of Sanctions Data Reveals

benzsevern — Thu, 16 Apr 2026 21:22:31 +0000

If you asked most people when the Wagner Group was first sanctioned by the United States, they'd say 2022 or 2023. That's the period the current narrative covers — the post-invasion wave of Russia-related designations, the Treasury press releases, the coordinated EU/UK/CA/AU announcements. It's also wrong by four years.

The first OFAC designation of "Private Military Company 'Wagner'" was on April 5, 2018, under the CAR (Central African Republic) and TCO (Transnational Criminal Organizations) programs, plus one called UKRAINE-EO13660 — a 2014-era authority that pre-dates the 2022 invasion by nearly a decade. I didn't know that going in; the dominant story about Russia sanctions skips over it. The pipeline I built to analyse sanctions data found it in about three seconds, because the pipeline doesn't read press releases — it reads dated XML snapshots of the SDN list itself.

This post is about what falls out when you point one entity-resolution pipeline at (a) every public sanctions list in the world, (b) a decade of historical OFAC snapshots from the Wayback Machine, and (c) a 13-million-wallet blockchain attribution graph, and ask it all the questions that compliance vendors don't want to answer. Six findings surface, four of which change how I'd think about screening coverage. This is a follow-up to last week's OSS vulnerability reconciliation post; same pipeline shape, completely different domain.

The three datasets

Three public data sources, stitched together on OpenSanctions' canonical entity IDs:

OpenSanctions sanctions collection — 278,647 FtM entities spanning 85 distinct public sanctions lists across 50+ jurisdictions. After filtering to real sanctionable entity schemas (Person, Organization, Company, LegalEntity, Vessel, Airplane, CryptoWallet), the universe is 72,326 canonical sanctioned entities. OpenSanctions has already resolved cross-list duplicates, so os_id is ground truth.
OpenSanctions peps collection — 708,385 politically-exposed persons across Wikidata, national legislature sources, and curated PEP datasets. Joins cleanly to the sanctions set by os_id.
Wayback Machine OFAC SDN snapshots — 38 usable distinct-digest captures of treasury.gov/ofac/downloads/sdn.xml between 2014 and 2023. 15,438 distinct UIDs ever seen across a decade.
Wallet-attribution graph — ~12.6 million Ethereum / Tron / Bitcoin addresses with public labels (Etherscan tags, Forta threat intelligence, DeFiLlama protocol IDs, Scamsniffer blacklists, FBI Lazarus disclosures) from my prior post.

Everything is free, permissively licensed, and bulk-downloadable. No paid API, no commercial compliance vendor, no proprietary chain analytics.

Finding 1: Wagner was on OFAC four years before the current narrative admits

The 38 Wayback snapshots of the SDN list let me compute a first-listing date for every UID. A substring search for "Wagner" returns exactly one hit:

2018-04-05  PRIVATE MILITARY COMPANY 'WAGNER'     progs=CAR,TCO,UKRAINE-EO13660

UID assigned in the April 2018 snapshot, under three separate authorities. UKRAINE-EO13660 dates to 2014 and was built for Crimea-annexation response — Wagner was pulled in under it because the unit was active in eastern Ukraine even then. CAR covered their operations in Central African Republic. TCO was the "this is a crime ring in all but name" framing.

For comparison, Vladimir Putin's personal OFAC entry dates to February 26, 2022 — two days after the invasion. That's also visible in the snapshot timeline. He wasn't on OFAC SDN before that; Wagner was on OFAC for four years before he was.

What this changes: a lot of the Russia-sanctions coverage treats "unprecedented" and "since 2022" as nearly interchangeable. The data says the authorities were already in place, parts of the Russian state-military-crime apparatus were already designated, and the post-2022 wave expanded an existing framework rather than built it from scratch. This is a more boring story but a more accurate one, and it's visible in public data if you parse the XML instead of the press releases.

Finding 2: 1 in 5 OFAC designations gets reversed

The decade of snapshots shows a shape most people don't associate with sanctions: sanctions regimes are not monotonic. Entities get added, but they also get delisted.

Snapshot	UIDs on SDN
2014-03-29	5,819
2016-04-02	5,266 (Iran JCPOA delistings)
2018-04-05	6,256
2019-08-05	7,737
2022-03-02	9,226
2022-08-06	10,489 (post-invasion wave)
2023-06-23	12,635

Distinct UIDs ever seen across the decade: 15,438. Current count: ~12,635. Delisted: 2,803 — 18.2%.

Almost one in five entities that was ever on the SDN has been removed. That happens for three main reasons: JCPOA-style diplomatic deals (the 2016 drop of ~600 Iran-related entries is visible as a cliff in the timeline), administrative restructuring (Tornado Cash's UID was retired and re-issued after a program-basis revision in 2022–2023), and sanctions-lift decisions where the underlying conduct stopped or was judged to be addressed.

What this changes: "being on the SDN" is not a life sentence in the data. The compliance industry does a lot of false-positive-weary work based on the assumption that OFAC designations are permanent records — and they are, in the sense that historical presence creates a lasting paper trail. But presence on the current list is a 1-in-5-you'll-be-removed-eventually proposition at the decade scale. That's not what most compliance training material implies.

The other shape in the delisting data worth noting: the 2022 Ukraine wave was the single largest multi-month expansion in the decade of snapshots. +1,263 UIDs between January and August 2022. The program code RUSSIA-EO14024 alone accounts for 2,588 designations since 2020 — more than any other OFAC program in recent memory. The scale of the 2022 expansion is unambiguous; the claim that it was built from scratch is the part that Finding 1 pushes back on.

Finding 3: Sanctioned crypto wallets are almost entirely new public intelligence

I joined the 2,461 publicly-sanctioned crypto-wallet entities from the sanctions universe against the 13M-address wallet-attribution graph. After excluding sanctions-derived labels (the OFAC list appears inside the attribution graph, so including it would be a tautology — every OFAC-sanctioned address would trivially match "attributed by OFAC"), the comparison is simple: of the 1,454 distinct sanctioned wallet addresses, how many carried a non-sanctions public label in Etherscan, Forta, DeFiLlama, exchange-attribution datasets, or the community-curated blacklists?

50 addresses. 3.4%.

Per-list breakdown after filtering out the tautological labels:

Sanctioning authority	Sanctioned wallets	Had prior external attribution
US OFAC SDN	772	50 (6.5%)
Israel MoD Crypto	685	0 (0.0%)
Japan MOF	26	9 (34.6%)
France Trésor	6	0

The OFAC overlap is mostly what you'd guess — Hydra Market, Blender.io, Chatex, Suex OTC all had prior Etherscan tags because these were darknet / mixer operations that blockchain observers had been tracking for years before formal designation. Those are the 50. The other ~720 OFAC-designated addresses had zero non-sanctions public labelling before OFAC added them.

Israel MoD's 685 wallets — almost all Tron-based, tied to Hamas financial infrastructure — had zero prior entries in any of the 10 wallet-attribution sources. Forta didn't flag them. Etherscan didn't tag them. Scamsniffer didn't blacklist them. The Israeli designation is 100% novel public intelligence for these addresses; every one of them was invisible to open-source chain analytics before the MoD published the list.

Japan MOF's 34.6% hit rate is a different phenomenon — most of the Japanese designations are mirrors of OFAC, so the prior-attribution there is shared OFAC lineage filtered through one additional authority.

What this changes: "blockchain is public, so chain analytics knows everything anyway" is a popular framing among people who think sanctions screening is an obsolete layer in crypto. The data says the opposite. Formal sanctions designations are adding new information to the public attribution graph in 97% of cases — information that the chain-analytics firms may have had internally but that no open-source database carried. If you run a crypto on-ramp or custody product, your sanctions screen is importing public intelligence you could not have derived from any other free source.

Finding 4: 63.9% of sanctioned entities are on exactly one list

This is the headline coverage number. The universe is 72,326 canonical entities; 46,219 of them (63.9%) appear on exactly one public sanctions list. The Big-3 Western stack (OFAC SDN + UK FCDO + EU Financial Sanctions) collectively covers 33.2% of the universe. Adding Canada, Australia, Japan, and Switzerland brings coverage to 35.5% — a 2.3-point improvement for four additional integrations.

Lists an entity appears on	Entities	Share
1	46,219	63.9%
2	10,483	14.5%
3	5,608	7.8%
4–7	5,112	7.1%
8+	4,683	6.5%

One-way coverage gaps are sharper than the averages suggest: OFAC-SDN-only screening misses 46.7% of EU Financial Sanctions designations and 46.1% of UK FCDO designations. That goes the other direction too — EU-only misses 84.2% of OFAC SDN, and UN Security Council coverage alone misses 95.8% of OFAC SDN, because OFAC designates hugely more unilaterally than the UN authorises.

This is the least structural of the findings and the most directly actionable. A compliance team buying "comprehensive sanctions screening" should ask which of the 85 public lists the vendor consumes — and should expect the honest answer to cover somewhere between 5 and 12, not 85.

Finding 5: PEP screening is a much weaker leading indicator than vendors suggest

One common compliance-industry claim is that aggressive PEP screening catches sanctions risk early — the theory being that politically-exposed persons are the pool from which future sanctioned individuals are drawn. I joined OpenSanctions' 708k-entity PEPs dataset against the 40,076 sanctioned persons in the sanctions universe.

Only 8.5% of sanctioned persons were in the PEPs dataset first. 3,419 of 40,076.

But the variance across individual sanctions lists is enormous and tells a more interesting story:

Sanctions list	Sanctioned persons	Also PEPs	PEP coverage
New Zealand Russia	1,352	870	64.3%
Australia DFAT	2,497	1,038	41.6%
Canada SEMA	3,457	1,330	38.5%
Japan MOF	1,970	668	33.9%
Switzerland SECO	3,721	1,253	33.7%
EU Travel Bans	3,829	1,267	33.1%
UK FCDO	3,853	1,221	31.7%
EU Financial Sanctions	4,319	1,315	30.4%
US OFAC SDN	7,384	1,394	18.9%
Ukraine NSDC	13,187	2,155	16.3%
Iraq AML	6,201	0	0.0%
Pakistan Proscribed	4,277	1	0.0%

What the table is actually showing: each sanctions regime has a characteristic target type, and PEP-overlap is a clean signature of it.

EU-cluster + Canada + Japan + Australia + NZ: 30–64%. These regimes disproportionately sanction current or former political figures — Russian officials, Belarusian ministers, Syrian regime members, Venezuelan ministers. PEP screening catches most of these in advance because "politician" and "politician's deputy" are exactly who PEP datasets track.
US OFAC SDN: 18.9%. OFAC sanctions a much broader spectrum — narcotics, transnational crime, cyber, proliferation, Cuban parastatals — most of whom are not politicians.
Iraq AML and Pakistan Proscribed Persons: 0%. Terror-focused lists. Terrorists don't hold public office.

What this changes: the "PEP screening is a leading indicator of sanctions risk" framing is true for some regimes and false for others, and the variance is a four-order-of-magnitude spread. If your compliance program's PEP screen is tuned for EU-style political designations (which most vendors default to), it will look strong against Canadian or Swiss sanctions and collapse to near-zero leading-indicator value against OFAC's transnational-crime designations or Pakistan's terror list. A flat "X% of sanctioned individuals were PEPs first" average hides this entirely.

Finding 6: Russia and the West sanction disjoint populations

One small finding worth flagging as-is: the ru_mfa_sanctions dataset — Russia's own Foreign Ministry counter-sanctions list — carries 2,163 entities, mostly Western politicians, journalists, officials. Zero of them appear on any Western sanctions list.

This is structurally the cleanest data point in the post. The West sanctions Russian entities; Russia sanctions Western ones; neither side sanctions its own nationals. The intersection is empty. Which sounds obvious in hindsight but is also a concrete illustration of the way sanctions regimes target completely disjoint populations — and a reminder that if your firm has exposure to Russian markets, screening only against Western lists misses an entire parallel regime.

Honest limitations

As with the prior two posts in this series, what I can't support matters as much as what I can:

Wayback snapshots are uneven. 67 distinct-digest captures between 2014 and 2023, but 29 of them were truncated or partially-archived and had to be discarded. The 38 usable snapshots span a decade but the intervals are irregular. A UID "first seen in 2018" may have actually been designated in 2017 if there's no 2017 snapshot; this systematically biases first-listing dates later.
Wagner's 2018 program codes don't equal Wagner's 2018 operational scope. OFAC can designate under historical authorities years after relevant conduct started. The April 2018 listing doesn't mean Wagner was first noticed in 2018 — it means the formal SDN designation happened then.
"OpenSanctions is ground truth." The canonical os_id is theirs, not mine. Their ER is audit-traced but not zero-error; a false merge or false split propagates through all the overlap stats.
PEP datasets are Wikidata-heavy. Of the 3,419 sanctioned-PEP overlaps, 95.6% come from wd_categories and 71.4% from wd_peps. Wikidata PEP coverage is uneven by country — Western European politicians are over-represented relative to Central Asian or African ones, which is one reason the NZ Russia list hits 64% overlap while Iraq AML hits 0%.
Crypto "prior attribution" is Ethereum-biased. 9 of 10 label sources in the 13M-wallet graph are Ethereum-specific. Bitcoin-addressed OFAC designations (Hydra, Blender, DPRK-related wallets) are undercounted relative to their real prior public coverage; the 3.4% number is a lower bound, not an upper one.
The 18.2% delisting rate is across the decade, not per year. Annual turnover is smaller — somewhere in the 2–4% range. The decade-scale figure is the right one to cite for "how often do designations get reversed?", but it's not "one in five designations is reversed each year."
No commercial database comparison. Dow Jones, World-Check, LexisNexis, and several chain-analytics firms maintain richer sanctions and PEP datasets. None are bulk-downloadable without a paid plan. This is a free-public-data analysis.

Takeaways

Wagner was first OFAC-listed in April 2018 under CAR / TCO / UKRAINE-EO13660 programs — four years before the current Russia-sanctions narrative starts. The post-2022 wave expanded an existing framework rather than built it.
18.2% of OFAC SDN's decade-history designations have been reversed. 2,803 of 15,438 ever-listed UIDs are not on the current list. Sanctions aren't permanent at the decade scale.
Sanctioned crypto wallets are 97% novel public intelligence. Only 3.4% had prior external attribution in a 13M-address graph. Israel MoD Crypto: 0% prior attribution; every one of the 685 addresses was invisible before the MoD published the list.
63.9% of sanctioned entities are on exactly one of 85 public lists. The Big-3 Western stack covers 33.2% of the universe. Adding four more Western regimes (CA/AU/JP/CH) adds only 2.3 points.
PEP screening is a 0-to-64% leading indicator depending on the sanctions regime. Clustered by target type: political-figure-focused regimes (EU, Canada, NZ) overlap with PEPs 30–64%; OFAC overlaps at 18.9%; terror-focused lists (Iraq AML, Pakistan Proscribed) overlap at zero.
Russia's counter-sanctions list (2,163 entities) has zero overlap with any Western sanctions list. Mirror-image regimes targeting disjoint populations.

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-sanctions-reconciliation. Five commands from a fresh clone:

python fetch_public_data.py      # OpenSanctions sanctions + peps + default (~3.5 GB, ~5 min)
python extract_records.py        # filter sanctions → parquet
python analyze.py                # cross-list coverage + pairwise overlap + famous cases
python analyze_peps.py           # sanctioned persons ∩ PEPs
python fetch_wayback_ofac.py     # ~70 Wayback SDN snapshots (~3 min with rate-limit courtesy)
python analyze_history.py        # first-listing dates + growth + delisting
python analyze_crypto.py         # cross-reference against wallet-attribution graph

The raw OpenSanctions data is CC-BY 4.0; Wayback snapshots are web.archive.org captures of treasury.gov/ofac/downloads/sdn.xml (U.S. government public domain); wallet-attribution data is permissively licensed. No API keys, no auth. Outputs land in output/ — report.json for the headline coverage numbers, history_report.json for the decade-of-OFAC findings, peps_report.json for the PEP-overlap table, crypto_report.json for the wallet-attribution cross-reference.

Companion repos, same ER pipeline, completely different domains: goldenmatch-vuln-attribution (869k OSS vuln records, 608k canonical clusters) and goldenmatch-wallet-attribution (13.1M blockchain attribution records). Three posts, three domains, one conceptual pipeline.

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the playground: bensevern.dev/playground.

Reproducibility footer.

Source datasets: OpenSanctions sanctions bulk collection (FtM JSON, snapshot dated 2026-04-16, 318 MB), OpenSanctions peps bulk (832 MB), Wayback Machine captures of treasury.gov/ofac/downloads/sdn.xml (67 distinct-digest snapshots 2014-03-29 → 2024-03-29, 38 usable), wallet-attribution graph from goldenmatch-wallet-attribution (10 label sources, 12.6M unique addresses post-normalisation).
Canonical sanctioned entities post-schema-filter: 72,326 (Person 40,076; Organization 17,088; LegalEntity 7,633; Company 2,862; CryptoWallet 2,461; Vessel 1,864; Airplane 342).
Distinct public sanctions lists represented: 85.
Entities on exactly one list: 46,219 (63.9%).
OFAC SDN 2014–2023: 5,819 → 12,635 UIDs. 15,438 distinct UIDs ever seen; 2,803 delisted (18.2%).
Wagner Group first OFAC designation: 2018-04-05, programs CAR / TCO / UKRAINE-EO13660.
Putin first OFAC designation: 2022-02-26, program RUSSIA-EO14024.
Sanctioned persons: 40,076. ∩ PEPs: 3,419 (8.5%).
Sanctioned crypto wallets: 1,454 distinct addresses. With prior external (non-sanctions) attribution: 50 (3.4%). Israel MoD Crypto prior-attribution rate: 0.0%.
Russia MFA counter-sanctions: 2,163 entities, 0 overlap with Western Big-7.
Tools: polars 1.39, Python 3.12, standard-library xml.etree.ElementTree, urllib.request. goldenmatch 1.4.4 installed but not required — OpenSanctions handles cross-list ER.
Hardware: Windows laptop, 32 GB RAM. Full pipeline completes in under 5 minutes once data is local; the Wayback fetch takes longer due to rate-limit courtesy (~3 min with a 1.5 s delay between requests).
Code and raw outputs: benzsevern/goldenmatch-sanctions-reconciliation (MIT). Scripts: fetch_public_data.py, fetch_wayback_ofac.py, extract_records.py, analyze.py, analyze_peps.py, analyze_history.py, analyze_crypto.py. Headline JSONs in output/.
Data date: 2026-04-16.

Originally published at https://bensevern.dev

infermap Now Runs in TypeScript: Schema Mapping on the Edge

benzsevern — Mon, 13 Apr 2026 22:11:25 +0000

You get a CSV from a vendor. The columns are fname, lname, tel, addr1. Your database expects first_name, last_name, phone, street_address. You write a mapping dict, ship it, and move on — until the next vendor sends givenName, surname, mobile, address_line_1.

infermap solves this by inferring column mappings automatically. You hand it a source schema and a target schema, and it returns the best 1:1 field assignment with confidence scores and per-scorer reasoning. No hardcoded synonyms. No manual mapping files. It shipped on PyPI in March 2026 and has been running in production Python pipelines since.

Today it's available on npm as a full TypeScript port — feature parity with the Python version, same algorithm, same accuracy. This post covers what changed, what stayed the same, and why a TypeScript infermap opens doors that Python alone couldn't.

What infermap Does in 30 Seconds

infermap runs a multi-scorer pipeline to match source fields to target fields:

Schema extraction — reads columns from CSVs, JSON, database tables, or in-memory records. Infers dtypes, null rates, cardinality, and samples values.
Common-prefix stripping — if every source column starts with prospect_, it strips the prefix before matching so prospect_city matches city.
Score matrix — for each (source, target) pair, seven independent scorers produce a score and a reasoning string. Results are combined via weighted average (minimum 2 contributors).
Optimal assignment — the Hungarian algorithm finds the minimum-cost 1:1 matching across the full M x N matrix.
Confidence filtering — assignments below the threshold (default 0.2) are dropped.

The result: a list of mappings, each with a confidence score and explainable reasoning from every scorer that contributed.

The Seven Scorers

Scorer	Weight	What It Does
ExactScorer	1.0	Case-insensitive exact name match
AliasScorer	0.95	Known synonyms — `fname` to `first_name`, extensible via config and domain dictionaries
LLMScorer	0.8	Pluggable LLM-backed scoring (stubbed by default)
InitialismScorer	0.75	Matches abbreviations via dynamic programming — `ASSI` to `assay_id`, `CONSC` to `confidence_score`
PatternTypeScorer	0.7	Semantic type detection from sample values — email, UUID, ISO date, phone, URL, IP, ZIP, currency
ProfileScorer	0.5	Statistical similarity — dtype, null rate, unique rate, value length, cardinality
FuzzyNameScorer	0.4	Jaro-Winkler string similarity on normalized names

The combination of name-based, value-based, and statistical scorers means infermap handles cases where any single approach fails. A column named col_7 won't match on name — but if every value is an email address, PatternTypeScorer catches it. A column named patient_identifier won't fuzzy-match mrn — but AliasScorer with the healthcare domain dictionary will.

What the TypeScript Port Looks Like

import { map } from "infermap";

const result = map(
  { records: vendorData, sourceName: "vendor" },
  { records: internalSchema, sourceName: "internal" }
);

for (const m of result.mappings) {
  console.log(`${m.source} -> ${m.target}  (${(m.confidence * 100).toFixed(0)}%)`);
}
// fname -> first_name  (98%)
// lname -> last_name   (98%)
// tel   -> phone       (91%)
// addr1 -> street_address (76%)

The API mirrors Python's. If you've used infermap in a Python pipeline, the TypeScript version feels identical — same function names, same config structure, same output shape.

Database support

import { mapFromDb } from "infermap/node";

// Map columns between two Postgres tables
const result = await mapFromDb(
  { uri: "postgresql://localhost/warehouse", table: "raw_imports" },
  { uri: "postgresql://localhost/warehouse", table: "canonical_customers" }
);

SQLite, PostgreSQL, and DuckDB are supported as optional peer dependencies — install only what you need.

Custom scorers

import { defineScorer, makeScorerResult, MapEngine, defaultScorers } from "infermap";

const domainScorer = defineScorer(
  "DomainScorer",
  (source, target) => {
    if (source.name.includes("price") && target.name.includes("amount")) {
      return makeScorerResult(0.85, "price/amount semantic match");
    }
    return null; // abstain
  },
  0.7
);

const engine = new MapEngine({
  scorers: [...defaultScorers(), domainScorer],
});

Config persistence

// Save a mapping for reuse — no re-inference needed
result.saveConfig("vendor_to_internal.json");

// Later: load and apply
import { applyConfig } from "infermap";
const renamed = applyConfig(newData, "vendor_to_internal.json");

Same Algorithm, Same Accuracy

The Python and TypeScript versions share a 162-case benchmark corpus — 82 cases from the Valentine schema matching benchmark plus 80 synthetic cases. Both implementations produce results within 0.0005 F1 of each other on the shared corpus.

Metric	Python	TypeScript
Overall F1	0.840	0.840
Valentine corpus (82 cases)	0.794	0.794
ChEMBL subset (25 cases)	0.819	0.819
Calibrated ECE	0.005	0.005

The Hungarian algorithm implementation is vendored in TypeScript (O(n³) — no scipy dependency), and every scorer was ported line-by-line with matching test cases. 186 TypeScript tests verify parity.

Zero Runtime Dependencies in Core

The TypeScript core has zero runtime dependencies. No lodash, no heavy string libraries, no Node.js built-ins. The Jaro-Winkler implementation, Hungarian algorithm, and pattern matchers are all self-contained.

This matters because it means infermap runs anywhere JavaScript runs:

Next.js Edge Runtime — map schemas in middleware or edge API routes
Cloudflare Workers — schema mapping at the edge, sub-50ms cold starts
Vercel Edge Functions — inline mapping in serverless functions
Browser — map schemas client-side in a data import wizard
Deno / Bun — no Node.js-specific APIs in the critical path

Node.js-specific features (file system access, database providers) are isolated in the infermap/node entrypoint. The core infermap import is edge-safe.

Doors This Opens

A Python-only infermap was useful for batch ETL jobs and data pipelines. A TypeScript infermap changes what's architecturally possible.

Upload-time schema resolution

When a user uploads a CSV to your web app, you can map their columns to your schema before the data ever hits your backend. Run infermap client-side or in an edge function, show the user the proposed mapping with confidence scores, let them confirm or override, then send the already-mapped data to your API.

// In a Next.js API route or edge function
import { map } from "infermap";

export async function POST(req: Request) {
  const { headers, sample } = await req.json();

  const result = map(
    { records: sample, sourceName: "upload" },
    { fields: TARGET_SCHEMA, sourceName: "canonical" }
  );

  // Return proposed mapping for user confirmation
  return Response.json({
    mappings: result.mappings,
    unmapped: result.unmapped_source,
    confidence: result.mappings.map((m) => m.confidence),
  });
}

No round-trip to a Python service. No cold-starting a container. The mapping happens at the edge in milliseconds.

Full-stack type safety

With TypeScript, the mapping result is fully typed. Your IDE autocompletes field names, catches typos at compile time, and your CI pipeline verifies the mapping config matches your schema types. Python's infermap returns dicts — TypeScript's returns typed objects that integrate with your existing type system.

Monorepo workflows

If your backend is Node.js or TypeScript, infermap slots into your existing build pipeline. No Python runtime to install in CI. No virtualenv to manage. One npm install infermap and you're done.

Browser-based mapping UIs

The score matrix that infermap computes — the full M x N grid of (source, target, confidence) tuples — is exposed via the API. You can render it as an interactive mapping UI where users see every candidate match, the confidence score, and the per-scorer reasoning. The zero-dependency core means this runs in the browser without bundling a runtime.

Shared config between Python and TypeScript

infermap configs are serialized as JSON (TypeScript) or YAML (Python). A mapping you infer in a Python notebook can be loaded and applied in a TypeScript API route, and vice versa. Teams that use Python for data science and TypeScript for production APIs can share mapping definitions without translation.

What Stayed the Same

Everything that matters:

Seven scorers with the same weights and logic
Hungarian algorithm for optimal 1:1 assignment
Common-prefix canonicalization before matching
Domain dictionaries for healthcare, finance, and ecommerce aliases
Confidence calibration via Isotonic (PAV) and Platt (Nelder-Mead) calibrators
CLI with map, apply, inspect, and validate subcommands
Config persistence — save once, apply forever

Install and Try It

npm install infermap

Or with database support:

npm install infermap better-sqlite3  # SQLite
npm install infermap pg              # PostgreSQL
npm install infermap @duckdb/node-api # DuckDB

benzsevern / infermap

Inference-driven schema mapping engine for Python and TypeScript. 7 built-in scorers, domain dictionaries (healthcare/finance/ecommerce), confidence calibration, cross-language accuracy benchmark (F1 0.84), and full Python↔TypeScript parity.

infermap

Inference-driven schema mapping engine.
Map messy source columns to a known target schema — accurately, explainably, with zero config.
Built by Ben Severn.

📖 Wiki · 🌐 Docs · 🧪 Examples · 💬 Discussions · 🐛 Issues

infermap is a schema-mapping engine. Give it any two field collections (CSVs, DataFrames, database tables, in-memory records) and it figures out which source field corresponds to which target field, with confidence scores and human-readable reasoning. Available as a Python package on PyPI and a TypeScript package on npm, with mapping decisions verified bit-for-bit by a shared golden-test parity suite.

Install

Python

pip install infermap

Optional database extras:

pip install infermap[postgres]   # psycopg2-binary
pip install infermap[mysql]      # mysql-connector-python
pip install infermap[duckdb]     # duckdb
pip install infermap[all]        # all

…

View on GitHub

The Python version is still on PyPI (pip install infermap) and both are maintained in the same monorepo with shared golden tests.

Key Takeaways

infermap's seven-scorer schema mapping engine is now on npm with full feature parity to the Python version
The TypeScript core has zero runtime dependencies — it runs in Edge Functions, Workers, browsers, and Node.js
Both versions score F1 0.840 on 162 real-world benchmark cases, verified to within 0.0005 F1
Upload-time schema resolution, browser mapping UIs, and full-stack type safety are now architecturally possible
Mapping configs are portable between Python and TypeScript — same team, two runtimes, one source of truth

Schema mapping shouldn't require a Python service, a container, and a five-second cold start. Now it doesn't. npm install infermap and map your first schema in under a minute.

Reconciling 15 OSS Vulnerability Databases: What They Actually Cover

benzsevern — Thu, 09 Apr 2026 22:22:05 +0000

If you run an open source project, you probably rely on a vulnerability scanner that queries one or two databases. Dependabot looks at GitHub Security Advisories. pip-audit looks at PyPA. cargo audit looks at RustSec. Each tool has an opinion about what counts as a known vulnerability, and those opinions only partially overlap.

I wanted to know, concretely, what the overlap looks like. Not "Dependabot is good" or "OSV is comprehensive" — actual numbers. So I did the same thing I did last week for blockchain attribution data: pointed one entity-resolution pipeline at every public vulnerability database I could download for free and let the union-find speak.

The answer is 869,771 records across 15 sources, collapsing to 608,463 canonical vulnerabilities. That reconciliation surfaces three findings I did not go looking for, and one of them changed how I think about OSS dependency scanning.

The fifteen sources

Every one of these publishes bulk exports, under permissive licenses, without an API key:

Source	Records	What it covers
OSV.dev (10 ecosystem bulks)	519,760	PyPI, npm, Go, Maven, RubyGems, crates.io, Packagist, NuGet, Debian, Alpine
GitHub Advisory Database	350,164	28,618 reviewed + 297,078 unreviewed mirrors
PyPA advisory-database	3,230	Python Packaging Authority curated vulns
Go vulnerability DB	3,079	Go modules
RustSec advisory-db	1,022	Rust crates
EPSS	~326,000	Exploit prediction scores per CVE
Total records ingested	869,771

Two things to notice about this list. First, OSV and GHSA dominate — between them they account for 870k of the 870k. The smaller ecosystem-specific databases (PyPA, RustSec, Go vulndb) are curated subsets that cover at most a few thousand entries each but often with higher-quality metadata. Second, GHSA splits internally into "reviewed" (28k — the set GitHub's security team actually touches) and "unreviewed" (297k — a passthrough mirror of NVD filtered to packages GitHub tracks). That split is going to matter.

The schema and the join

I projected every source to a nine-column row:

vuln_id    aliases   ecosystem   package   purl   published   modified   severity   source

vuln_id is the primary identifier that source uses — a GHSA-xxxx, CVE-xxxx, PYSEC-xxxx, RUSTSEC-xxxx, GO-xxxx, or MAL-xxxx. aliases is a semicolon-joined list of cross-database identifiers the source knows about. purl is the Package URL — a canonical string like pkg:pypi/tensorflow or pkg:maven/io.grpc/grpc-protobuf that uniquely names a package across every public ecosystem.

The useful insight for the ER work is that OSV's aliases field is a partial ground truth for the reconciliation pipeline. An OSV entry for GHSA-gcx2-gvj7-pxv3 might say aliases: [CVE-2022-24766, PYSEC-2022-170]. A separate entry in the PyPA database for PYSEC-2022-170 says aliases: [GHSA-gcx2-gvj7-pxv3, CVE-2022-24766]. The alias graph is mostly pre-computed — the ER pipeline's job is to walk it transitively and catch the cases where it isn't.

That's a union-find. I pointed one at the (vuln_id, aliases) pair for every row:

parent: dict[str, str] = {}

def find(x: str) -> str:
    while parent.get(x, x) != x:
        parent[x] = parent.get(parent[x], parent[x])
        x = parent[x]
    return x

def union(a: str, b: str) -> None:
    ra, rb = find(a), find(b)
    if ra != rb:
        parent[rb] = ra

for row in df.iter_rows(named=True):
    vid = row["vuln_id"]
    parent.setdefault(vid, vid)
    for a in row["aliases"].split(";"):
        a = a.strip()
        if a:
            parent.setdefault(a, a)
            union(vid, a)

Forty lines of code, finishes in under a second on 616,237 distinct identifiers. After the compaction pass the pipeline has 608,463 canonical vulnerability clusters. Of those, 345,568 (57%) collapsed two or more distinct identifiers — meaning more than half of every canonical vulnerability in the free public data carries a cross-database alias.

That's a much denser ER signal than the blockchain dataset from last week. The clusters are smaller on average (most have 2-3 IDs, not 10-45) but the ratio of "records that participate in multi-ID resolution" is dramatically higher. OSS security data is deliberately cross-linked; blockchain attribution data is not.

Finding 1: GitHub reviews 9.1% of what it ingests

Here is the headline number, and here is why I want to be careful about it.

Set	Canonical clusters
Full OSS vulnerability universe (union of all sources)	312,250
`github-reviewed` (GitHub security team curated)	28,419 (9.1%)
`github-unreviewed` (NVD mirror filtered to tracked packages)	297,076 (95.1%)
OSV across all ecosystems (any)	312,098 (99.95%)

9.1% is the percentage of the full free OSS vulnerability universe that ends up in GitHub's reviewed advisory set — the one the GitHub security team actually curates, enriches, and writes human-readable metadata for. The other 91% passes through GHSA as unreviewed CVE mirrors.

I want to flag this next part explicitly, because it is the kind of number that is easy to misrepresent. This is not "Dependabot misses 91% of vulnerabilities." Dependabot consumes both the reviewed and unreviewed GHSA sets, so in terms of raw ID awareness its coverage is much closer to the full universe. What the 91% number actually measures is the curation ratio: out of every hundred OSS vulnerability IDs that flow through GitHub's advisory pipeline, only about nine get the human review, the summary rewrite, the CWE assignment, the affected-versions normalization, the severity validation.

So the accurate framing is: most of what Dependabot shows you is passthrough data. Nine percent of it has been curated by a human on GitHub's security team. That's still interesting — most developers do not know their tool is 91% passthrough — but it is a statement about metadata quality, not a statement about coverage.

For the record: github-reviewed overlaps heavily with the per-ecosystem curated sets. PyPA, RustSec, and Go vulndb are all disjoint enrichment paths that contribute a few thousand high-quality entries each. If you point one tool at all of them, your curated coverage roughly doubles. If you point one tool at the whole public universe, your passthrough coverage goes to 99%. Most tools do neither.

Finding 2: The JavaScript ecosystem has more tracked vulnerabilities than everything else combined

Ecosystem	Canonical vulns	Ratio to npm
npm	217,162	1.00×
Debian (4 active releases combined)	~160,000	0.74×
PyPI	15,920	0.07×
Maven	6,370	0.03×
Packagist (PHP)	5,571	0.03×
Go	3,627	0.02×
Alpine (10+ versions combined)	~25,000	—
RubyGems	1,988	0.009×
NuGet (.NET)	1,653	0.008×
crates.io	1,396	0.006×

npm has 14× more tracked vulnerabilities than PyPI and 131× more than NuGet. I want to be careful here too. There are at least three reasonable explanations for why these numbers look the way they do, and the data cannot distinguish between them:

npm has a much larger surface area. The JavaScript ecosystem has more packages, more transitive dependencies per package, more maintainers, and more velocity. A bigger numerator is expected.
npm gets much more adversarial attention. Typo-squatting campaigns, malicious packages, and coordinated supply chain attacks target npm disproportionately because it's where the blast radius is largest. More attention finds more bugs.
Other ecosystems get less scrutiny. NuGet has 1,653 reported vulnerabilities across all of public .NET. That number is suspiciously small for an ecosystem that has run enterprise backends for two decades. Either .NET is miraculously clean or nobody is looking.

The honest read is that all three are partly true. The 130× gap between npm and NuGet is not a claim that npm is 130× less safe — it is a claim that the free public vulnerability-visibility stack is 130× more attentive to npm. If you are a .NET developer relying entirely on free tools, your observable attack surface is smaller than your actual one.

Finding 3: The free OSS stack is structurally blind to system-level vulnerabilities

This is the finding I did not go looking for, and it is the one that will stick with me. I wrote a small section in the analyzer that looks up half a dozen famous vulnerabilities by CVE ID and dumps the cluster they resolve to:

famous = {
    "Log4Shell":    "CVE-2021-44228",
    "Spring4Shell": "CVE-2022-22965",
    "Heartbleed":   "CVE-2014-0160",
    "Shellshock":   "CVE-2014-6271",
    "ProxyShell":   "CVE-2021-34473",
    "ZipSlip":      "CVE-2018-1002105",
}

Half of these resolve beautifully:

Vuln	Cluster sources	Ecosystems	Affected packages
Log4Shell	ghsa-reviewed + osv-Maven	Maven	5 log4j-derivative packages
Spring4Shell	ghsa-reviewed + osv-Maven	Maven	5 Spring packages
ZipSlip	ghsa-reviewed + go-vulndb + osv-Go	Go	`github.com/kubernetes/kubernetes`

Log4Shell's cluster correctly identifies org.apache.logging.log4j:log4j-core plus four derivative wrappers (com.guicedee.services:log4j-core, org.ops4j.pax.logging:pax-logging-log4j2, etc.). If you were writing a Maven SBOM scanner, the ER pipeline has just done most of your work.

The other three resolve to nothing:

Vuln	Cluster sources	Ecosystems	Affected packages
Heartbleed (CVE-2014-0160)	ghsa-unreviewed only	none	none
Shellshock (CVE-2014-6271)	ghsa-unreviewed only	none	none
ProxyShell (CVE-2021-34473)	ghsa-unreviewed only	none	none

Heartbleed is in the data. It has a CVE ID. It exists in the GHSA unreviewed mirror. But its cluster has no ecosystem tag and no affected package. None of the curated sources — not PyPA, not RustSec, not Go vulndb, not any OSV ecosystem bucket — has Heartbleed attached to a single package. Same story for Shellshock. Same story for ProxyShell.

Why? Because OpenSSL, bash, and Microsoft Exchange Server are not distributed through managed package ecosystems. OpenSSL ships as a C library bundled into operating system images, container base layers, Python wheels via cryptography, Node.js builds, and about a thousand other places that do not go through npm or PyPI. Bash ships as a distro package. Exchange ships as an installer. None of them have a PURL. None of them have a declarable version range in a requirements.txt. Package-level scanners cannot see them by construction.

This is a structural property of how the free OSS vulnerability tooling stack is wired. The scanners that developers actually run — Dependabot, pip-audit, cargo audit, npm audit, Snyk's free tier — all resolve vulnerabilities against package manifests. If the vulnerability is in a system library, the manifest does not reference it, and the scanner is silent.

The next Heartbleed will not be detected by any of these tools. Not because the databases don't know about it — Heartbleed itself is in all of them — but because the thing doing the matching is asking the wrong question. It's asking "which of my declared packages is affected?" when it should be asking "which of the binaries actually installed on this machine is affected?" That is a completely different pipeline, and it lives in tools like Trivy, Grype, and Syft that do container image scanning. Most developers do not run those tools.

I did not expect ER to find this. I was looking for cross-database name disagreements and got handed a structural blind spot instead. The entity-resolution pipeline made it obvious because it projects every source to the same (ecosystem, package) key — and when Heartbleed consistently projects to (none, none), the null result is loud.

What else is in the data

A few secondary findings that do not need their own sections:

The highest-ID-count clusters are Bitnami container fanout. The top of the disagreement list is dominated by entries like GHSA-4xp2-w642-7mcx, which has ten IDs: BIT-cilium-2023-41333, BIT-cilium-operator-2023-41333, BIT-cilium-proxy-2023-41333, BIT-hubble-2023-41333, BIT-hubble-relay-2023-41333, BIT-hubble-ui-2023-41333, plus the root GHSA and CVE. Bitnami's scanner emits one BIT-prefixed identifier per container variant of the same underlying vulnerability. The union-find correctly collapses these, which is a legitimate ER outcome, but it is not the dramatic cross-database name disagreement I was hoping for. The real story is boring: OSV has a known vuln, six Bitnami container images inherit it, and the ID-per-container convention inflates the count.

Cross-ecosystem misfiling exists in the raw data. While sampling OSV's PyPI ecosystem dump I found GHSA-cfgp-2977-2fmm — filed in the PyPI directory, but its only affected package is pkg:maven/io.grpc/grpc-protobuf, a Java gRPC library. If you filter OSV by directory name instead of by PURL, you silently lose vulnerabilities to misfiling. The ER pipeline catches this automatically because it joins on PURL, not on directory.

EPSS does not change the coverage story. Every CVE has an EPSS exploit-prediction score (326k of them), and I pulled the dataset hoping to find that high-EPSS vulns are better covered across databases than low-EPSS ones. They are not, meaningfully. Coverage is a function of which ecosystem the package lives in, not how exploitable the vuln is. That is its own kind of finding but does not carry a post on its own.

Honest limitations

I want to be precise about what this analysis is and isn't:

No NVD direct ingestion. I pulled NVD via its propagation into GHSA-unreviewed and OSV rather than hitting the REST API directly. That covers most OSS-ecosystem packages but does miss NVD entries that never made it into either mirror. Adding NVD as a 16th source would expose the "pure NVD coverage gap" question but take ~15 minutes of paginated fetching.
Union-find on literal IDs. Case-insensitive normalization is not applied. In practice OSV, GHSA, and the curated sources are consistent about identifier format, but this is worth stating.
Row counts are not vuln counts. One advisory that affects three packages emits three rows. The canonical-cluster numbers in this post are distinct counts after ER, not raw rows. Both are in output/report.json.
No version-range normalization. The ER pipeline joins on the (vuln_id, alias) graph, not on affected versions. This is sufficient for "which databases know about this vulnerability," but not for "is the specific version I have installed affected." Those are different questions and need different pipelines.
No commercial database comparison. Snyk, Sonatype, Chainguard, Anchore, and JFrog all maintain databases that are richer than anything in this post. None of them are bulk-downloadable without a paid plan. The story here is specifically about the free tier, which is what most individual developers actually use.
"Blind spot" is strong language. The free OSS tooling stack is blind to Heartbleed-class vulnerabilities when invoked as a package-level scanner. Container scanners like Trivy, Grype, and Syft do look at system libraries. The blind spot is at the specific layer most developers interact with — dependabot or pip-audit on a repo — not at the whole ecosystem.

Takeaways

15 free public databases, 869,771 records, 608,463 canonical vulnerabilities after union-find on the cross-database alias graph.
GitHub Security Advisories reviews about 9.1% of what it ingests. Most of what Dependabot surfaces is passthrough NVD data with no curation, no CWE assignment, and no human review. Developers do not usually know this.
The JavaScript ecosystem has 14× more tracked vulnerabilities than Python and 131× more than .NET. The data cannot tell you whether that is attention, scrutiny, or real exposure — but the asymmetry itself is measured.
Package-level vulnerability scanners cannot see Heartbleed, Shellshock, or ProxyShell. Not because the databases don't know — they do — but because these vulnerabilities live in system software with no PURL and no declarable dependency. The free OSS stack is structurally blind to this class by construction. If you care about system-library vulns, run a container scanner.
Entity resolution is the right tool for this question. Union-find on the alias graph collapses 57% of canonical vulnerabilities across cross-database identifiers, producing a unified view that no single tool gives you. The blockchain post from last week established the same pattern for a completely different domain; the pipeline is domain-agnostic.

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-vuln-attribution. Four commands from a fresh clone:

python fetch_public_data.py     # ~600 MB download, ~5 min
python count_sources.py         # diagnostic row count, optional
python extract_records.py       # sources → single parquet (~30 sec)
python analyze.py               # union-find ER + findings

All six data sources are permissively licensed and redistributable. No API keys. No auth. The full 869k-row analysis finishes in under a minute once the data is local. Outputs land in output/ — report.json for the headline numbers, famous_vulns.json for the Log4Shell/Heartbleed/Shellshock clusters, top_disagreement.json for the Bitnami fanout examples.

If you want to see the same ER pattern applied to a completely different domain, the companion repo is benzsevern/goldenmatch-wallet-attribution — 13.1 million blockchain attribution records reconciled the same way. Both posts use the same library (GoldenMatch) and the same conceptual pipeline; only the data changes.

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the playground: bensevern.dev/playground.

Reproducibility footer.

Source datasets: OSV.dev bulk exports (osv-vulnerabilities.storage.googleapis.com, 10 ecosystems), github/advisory-database main branch, pypa/advisory-database main, rustsec/advisory-db main, golang/vulndb master, EPSS current scores (epss.empiricalsecurity.com).
Total download: ~600 MB of zip archives, read in place via zipfile.ZipFile (no extraction — NTFS cluster overhead blows up millions of tiny JSON files by two orders of magnitude).
Input rows: 869,771 across 15 sources.
Unique vuln_ids: 616,237.
Canonical vulnerabilities post-ER: 608,463. Clusters with 2+ IDs: 345,568. Full OSS universe: 312,250.
github-reviewed share of full universe: 9.1% (28,419 / 312,250).
Tools: goldenmatch 1.4.4 (conceptual reference, pipeline is union-find + polars for the scale-up), polars 1.39, pyyaml 6.0, Python 3.12.
Hardware: Windows laptop, 32 GB RAM. Full pipeline completes in under 90 seconds once data is local.
Code and raw outputs: benzsevern/goldenmatch-vuln-attribution (MIT). Scripts: fetch_public_data.py, count_sources.py, extract_records.py, analyze.py. Headline JSON: output/report.json.
Data date: 2026-04-10.

Originally published at https://bensevern.dev

Wallet Attribution at Scale: ER on 13M Blockchain Records

benzsevern — Thu, 09 Apr 2026 18:32:45 +0000

Every public blockchain attribution dataset is a partial, opinionated view of the same underlying reality. OFAC publishes ~800 sanctioned crypto wallets. Etherscan crowdsources ~50,000 tags across seven EVM chains. Sourcify holds ~14 million verified contract deployments. Forta tracks known-malicious contracts. DeFiLlama catalogs protocol addresses. Israel's Ministry of Defense and the FBI's Lazarus unit each publish their own targeted lists. None of them agree, none of them are complete, and almost none of them talk to each other.

I wanted to know what happens if you reconcile all of them. Not with a custom schema, not with hand-written joins — with a single entity resolution pipeline pointed at every public source I could find. The answer is 13,147,920 input rows, 30,958 multi-source clusters, and three findings I could not have produced at smaller scale.

The ten sources

I pulled every freely redistributable blockchain attribution dataset I could verify:

#	Source	Rows	What it covers
1	OFAC SDN Enhanced XML	788	US Treasury sanctioned wallets, 18 chains
2	brianleect/etherscan-labels	52,773	Crowdsourced Etherscan tags, 7 EVM chains
3	dawsbot/eth-labels	17,495	Curated Ethereum mainnet categories
4	Sourcify parquet exports	13,062,088	Verified contract deployments, all chains
5	Forta labelled-datasets	7,480	Known malicious contracts + phishing
6	DeFiLlama protocols	3,332	Protocol contract addresses
7	ScamSniffer blacklist	2,530	Reported scam addresses
8	ethereum-lists	717	Dark/light address lists
9	OpenSanctions: il_mod_crypto	684	Israel MoD sanctioned wallets
10	OpenSanctions: us_fbi_lazarus_crypto	33	FBI Lazarus Group wallets
	Total	13,147,920

Sourcify dominates by two orders of magnitude. Everything else is a long tail of curated, opinionated, high-signal labels. That asymmetry shapes the whole story: Sourcify tells you what addresses exist, the other nine tell you what they mean, and entity resolution is what turns one into the other.

The schema

Every source projects to five columns:

COMMON_COLS = ["address_norm", "address_raw", "entity_name", "label", "source"]

address_norm is the joining key: lowercase, 0x prefix stripped.
address_raw keeps the original format for display.
entity_name is whatever the source calls it ("LAZARUS GROUP", "Safe: Proxy Factory 1.3.0", "").
label is the source-specific tag, namespaced (etherscan:etherscan:ofac-sanctions-lists).
source is the dataset identifier.

No fuzzy matching on names. Names disagree too often to be a primary signal — that's actually the central finding. The only reliable join is address_norm.

Running GoldenMatch at 535k

I started at a sensible scale: five sources plus Sourcify's Ethereum mainnet subset, 535,336 rows, and a direct call to goldenmatch.dedupe:


result = gm.dedupe(
    str(STAGED / "01_ofac.csv"),
    str(STAGED / "02_etherscan_labels.csv"),
    str(STAGED / "03_eth_labels.csv"),
    str(STAGED / "04_ethereum_lists.csv"),
    str(STAGED / "05_defillama.csv"),
    str(STAGED / "06_sourcify.csv"),
    exact=["address_norm"],
    blocking=["address_norm"],
)

This finished in about 40 seconds on a Windows laptop, found 12,640 multi-member clusters, and auto-fixed 51 data quality issues in the raw public sources (smart quotes, invisible characters, stray whitespace) before matching. GoldenCheck's quality scanner is bundled into the dedupe call — you don't ask for it, it just happens.

The results at 535k surfaced the best single anecdote in the whole exercise:

Address	OFAC name	Etherscan name
`0x098B716B8Aaf21512996dC57EB0615e2383E2f96`	LAZARUS GROUP	Ronin Bridge Exploiter
`0x5f48c2a71b2cc96e3f0ccae4e39318ff0dc375b2`	SEMENOV, Roman	Tornado.Cash: Team 1 Vesting

The first row is the Axie Infinity Ronin Bridge wallet — the address behind the $625 million Lazarus Group hack, labeled by OFAC as "LAZARUS GROUP" and by Etherscan as "Ronin Bridge Exploiter." Two correct names, completely unrelated strings. A name-based join finds nothing. An address-normalized join finds the link instantly. The second row ties a sanctioned Tornado Cash co-founder to a specific named vesting contract. If you took only one thing from this post, take that: names disagree, addresses don't.

Scaling to 13 million

The 535k run validated the pipeline. I wanted to know what happened at the real ceiling of free public data. That meant pulling all 14 Sourcify deployment parquets (one per million contracts, ~2 GB total) covering every chain Sourcify tracks — not just Ethereum mainnet.

# fetch_all_sourcify.py — parallel download of 14 parquets
with ThreadPoolExecutor(max_workers=4) as ex:
    for fut in as_completed([ex.submit(fetch_parquet, f) for f in files]):
        fut.result()

And then the staging step streams them directly into the common schema via Polars without ever materializing the full frame:

for pq in parquets:
    df = pl.read_parquet(pq, columns=["chain_id", "address"])
    out = df.select([
        pl.col("address").bin.encode("hex").alias("address_norm"),
        ("0x" + pl.col("address").bin.encode("hex")).alias("address_raw"),
        pl.lit("").alias("entity_name"),
        ("sourcify:chain_" + pl.col("chain_id").cast(pl.Utf8)).alias("label"),
        pl.lit("sourcify").alias("source"),
    ])
    out.write_csv(f, include_header=False)

After staging, the full dataset is 13,147,920 rows across 10 sources.

The honest caveat

At 13M rows, calling goldenmatch.dedupe crashes at the cluster-materialization step with a MemoryError in the Python dict build-out. That's not a GoldenMatch bug — it's pure Python object overhead on 12 million unique cluster keys. Since the full pipeline was already reducing to exact-match-on-address_norm blocking (names disagree too much to fuzzy on), the operation is mathematically equivalent to a polars groupby. I wrote that directly:

all_df = pl.concat([pl.scan_csv(p) for p in STAGED.glob("*.csv")]).collect()

clusters = (
    all_df.group_by("address_norm")
    .agg([
        pl.col("source").unique().alias("sources"),
        pl.col("source").n_unique().alias("n_sources"),
        pl.col("source").len().alias("size"),
        pl.col("entity_name").filter(pl.col("entity_name") != "").alias("entities"),
        pl.col("label"),
    ])
)

Same logical result, fits in memory, runs in about 30 seconds. The 535k run proves the ER pipeline works end-to-end with GoldenMatch's full feature set (fuzzy scorers, blocking strategies, lineage, golden records). The 13M run uses GoldenMatch's auto-config decisions as the template but delegates the exact-match groupby to Polars because Python dicts are the wrong data structure at that volume. I want to be upfront about that — the scale-up is not an endorsement of "GoldenMatch scales to 13M natively," it's an endorsement of "GoldenMatch chose the right blocking strategy at 535k, and that strategy is trivially reproducible at 13M in a columnar engine."

What the 13M run surfaced

1. Nine wallets cross-sanctioned by two governments

This is the headline finding and only possible because I had two independent sanctions sources. Nine crypto wallets appear on both the US Treasury OFAC list and Israel's MoD sanctioned crypto list:

Wallet	Entity	Chain
`TCzq6m2zxnQkrZrf8cqYcK6bbXQYAfWYKC`	ZEDCEX EXCHANGE LTD	Tron
`TGsNFrgWfbGN2gX25Wcf8oTejtxtQkvmEx`	ZEDCEX EXCHANGE LTD	Tron
`TTS9o5KkpGgH8cK9LofLmMAPYb5zfQvSNa`	ZEDCEX EXCHANGE LTD	Tron
`TNuA5CQ6LB4jTHoNrjEeQZJmcmhQuHMbQ7`	ZEDCEX EXCHANGE LTD	Tron
`TLvuvpfBKdxddxSsJefeiGCe9eVY8HUroE`	ZEDCEX EXCHANGE LTD	Tron
`TWBAPzpPiZarfVsY2BLXeaLhNHurn4wkWG`	AL-LAW, Tawfiq Muhammad Sa'id	Tron
`0x175d44451403Edf28469dF03A9280c1197ADb92c`	GAZA NOW	Ethereum
`0x21B8d56BDA776bbE68655A16895afd96F5534feD`	GAZA NOW	Ethereum
`19D1iGzDr7FyAdiy3ZZdxMd6ttHj1kj6WW`	BUY CASH MONEY AND MONEY TRANSFER CO	Bitcoin

The ZEDCEX cluster is the standout: five wallets on a single Tron-based exchange independently sanctioned by both the United States and Israel. GAZA NOW contributes two cross-confirmed Ethereum wallets. These are the highest-confidence sanctioned wallets in the entire dataset — not because any individual list is more authoritative, but because two independent government entity resolution processes landed on the same on-chain identities.

You cannot find this with OFAC alone. You cannot find it with Israel's list alone. You find it only when you reconcile them.

2. The largest clusters are universal infrastructure

At multi-chain scale, the top multi-source clusters by member count are all deterministic-deployment contracts:

Size	Address	What it is
45	`0x7cbb62eaa69f79e6873cd1ecb2392971036cfaa4`	Safe: Create Call 1.3.0
43	`0x40a2accbd92bca938b02010e17a5b8929b49130d`	Safe: Multi Send Call Only 1.3.0
42	`0xa6b71e26c5e0845f74c812102ca7114b6a896ab2`	Safe: Proxy Factory 1.3.0
40	`0x3e5c63644e683549055b9be8653de26e0b4cd36e`	Safe: Singleton L2 1.3.0
39	`0xd9db270c1b5e3bd161e8c8503c55ceabee709552`	Safe: Singleton 1.3.0
37	`0xf48f2b2d2a534e402487b3ee7c18c33aec0fe5e4`	Safe: Compatibility Fallback Handler 1.3.0
30	`0x000000000022d473030f116ddee9f6b43ac78ba3`	Uniswap Permit2
27	`0x66a71dcef29a0ffbdbe3c6a460a3b5bc225cd675`	LayerZero Ethereum Endpoint

Safe (formerly Gnosis Safe) uses CREATE2 with chain-independent salts, which means the same singleton contract ends up at the same address on every EVM chain it's deployed to. So do Permit2 and LayerZero. The cluster size is literally a count of "how many chains is this deployed on?" — 45 chains for Create Call 1.3.0, 30 for Permit2.

That's a real finding about the structure of the modern EVM ecosystem. Entity resolution on multi-chain contract deployment data automatically surfaces the universal-infrastructure layer without anyone asking it to. If you're building an allowlist of "standard reusable contracts that are safe-by-reputation across every chain," this cluster table is a reasonable starting point. I did not go in looking for this — it just fell out of the data.

3. Attackers verify source code at the long-tail baseline rate

I also pulled Forta's labelled-datasets, which include 719 known-malicious Ethereum smart contracts and 569 phishing-scam contracts. The honest question: do attackers publish verified source code on Sourcify?

Population	Size	Verified on Sourcify
Forta malicious contracts	719	3 (0.4%)
Forta phishing contracts	569	3 (0.5%)
ScamSniffer addresses	2,530	0 (0%)

This is where I have to resist the obvious headline. "Malicious contracts almost never verify source code" is technically true but misleading: Sourcify holds ~324k verified Ethereum mainnet contracts against an estimated ~70M+ total contracts ever deployed, which puts the baseline verification rate around 0.5%. Malicious contracts at 0.4% are statistically indistinguishable from that baseline.

The defensible framing is this: malicious contracts behave like the long tail of random/abandoned/spam contracts, not like production contracts. Mainstream DeFi protocols verify at rates well above 50%. Attackers don't. For the purposes of attribution, "is this a Sourcify-verified contract?" on its own is a weak filter — but "verified Sourcify contract AND has Etherscan tags AND appears in eth-labels AND DeFiLlama" is an extremely strong legitimacy signal. The 301 quadruple-confirmed clusters at 13M scale are the set of contracts that every independent attribution observer agrees exist and matter.

The 3 verified malicious contracts are outliers worth manual investigation: two are Fake_Phishing tagged contracts that nonetheless published source (presumably to look legitimate to a casual reviewer), and one is a suspicious "TrueEUR" token.

What else is in the data

A few secondary findings that didn't need their own section:

Deployer-reuse patterns in the malicious-contracts dataset. The Forta dataset records contract_creator for each malicious contract. Grouping by creator surfaces a heavily skewed distribution: one address deployed 15 Fake_Phishing contracts, another deployed 11, and the original bZx Exploiter 1 wallet deployed 9 distinct exploit contracts — all tied back to the 2020 bZx flash-loan attack. Twelve deployer addresses are responsible for roughly 15% of the entire labeled malicious-contract corpus. Watching deployers is dramatically more efficient than watching deployments, and the clustering falls out of ER trivially.

OFAC's internal duplicates. The SUEX OTC wallets appear in OFAC's own list twice — once under the XBT:CYBER2 program and once under USDT:CYBER2, because Treasury sanctioned the same Bitcoin address for Bitcoin activity and the Tron-USDT it bridged through. Without ER you'd treat them as two distinct records; with ER the internal duplication is obvious.

Cross-source pattern distribution. At 13M the dominant multi-source pattern is etherscan-labels + sourcify (12,472 clusters) — verified contracts that are also tagged. Then eth-labels + forta (5,146) — curated DeFi labels overlapping with malicious flags. Then the triple-confirmed eth-labels + etherscan-labels + sourcify (5,034). The full distribution is in output_15m/report.json in the companion repo.

Honest limitations

I want to be precise about what this analysis is and isn't:

It is not criminal wallet discovery. Every "sanctioned" label comes from a government source. ER reconciles those labels across sources. It does not identify new bad actors. Nothing in this post claims to.
It is not a substitute for on-chain forensics. Chainalysis-style graph tracing answers a completely different question (follow the flows). This pipeline answers "whose opinions do we have about this address, and do they agree?"
Ground truth is bounded. When two sanctions lists agree, you have two-jurisdiction confirmation, which is strong. When Forta and eth-labels agree on a malicious tag, you have two independent community labels, which is weaker. Nothing here is a court case.
The Sourcify baseline assumes the universe of all Ethereum contracts. If you normalize against "contracts anyone cares about" instead of "all contracts ever deployed," the verification-rate story changes. I chose the inclusive denominator on purpose — it's what Sourcify's own data supports.
Names are dirty. I found five different OFAC entries for the same Bitcoin address with slightly different entity spellings; two different Etherscan tags for the same Lazarus wallet; and an Israeli-sourced wallet whose "entity_name" field was just the address itself. ER is only as clean as the input. GoldenCheck auto-fixed 51 text-level issues before matching, but it didn't — and shouldn't — normalize semantic disagreement.

Takeaways

Ten public blockchain attribution datasets, 13.1M records, 30,958 multi-source clusters. The free public attribution universe is larger than it looks if you combine it, and trivially reconcilable if you normalize the address.
Names disagree, addresses don't. The Lazarus Group / Ronin Bridge Exploiter case is the best two-word argument for entity resolution on blockchain data I've seen.
Cross-jurisdictional sanctioning is real and detectable. Nine wallets — including a cluster of five ZEDCEX addresses — are sanctioned by both the US and Israel. You only see this if you reconcile multiple sanctions sources.
ER on multi-chain contract data surfaces universal infrastructure for free. The top clusters are Safe, Permit2, and LayerZero — deployed at the same CREATE2 address across 30-45 chains each. Cluster size is the chain count.
Attackers verify source code at the long-tail baseline rate, not the production rate. The useful signal is not "verified" but "verified and independently tagged by multiple labelers."

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-wallet-attribution. The full flow is four commands:

python fetch_public_data.py    # ~2.5 GB download, ~10 min
python extract_ofac.py         # parse SDN_ENHANCED.xml
python run_15m.py              # stage 10 sources to common schema
python analyze_15m.py          # cross-source cluster analysis

All ten data sources are permissively licensed and redistributable. No API keys. No auth. The ~13M-row analysis finishes in about 3 minutes of wall-clock time on a laptop once the data is local.

If you want to see what GoldenMatch looks like with its full feature set — fuzzy scoring, blocking strategies, lineage tracking, golden records — the earlier archive/run_clusters.py runs it on the 535k-row subset end-to-end. That's the run that surfaced the Lazarus / Ronin Bridge case. Both scripts are preserved because they're answering different questions: does the ER pipeline work? (yes, 535k with GoldenMatch) and what falls out at the public-data ceiling? (the 13M run above).

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the live playground: bensevern.dev/playground.

Reproducibility footer.

Source datasets: OFAC SDN Enhanced XML (US Treasury), Sourcify parquet exports (export.sourcify.dev, manifest timestamp 2026-01-05T16:48:52Z), brianleect/etherscan-labels (main), dawsbot/eth-labels (master), MyEtherWallet/ethereum-lists (master), forta-network/labelled-datasets (main), scamsniffer/scam-database (main), api.llama.fi/protocols, OpenSanctions us_fbi_lazarus_crypto and il_mod_crypto (latest).
Total download: ~2.5 GB.
Input rows: 13,147,920 across 10 sources.
Unique addresses: 12,588,179.
Multi-source clusters: 30,958. Quadruple-confirmed: 301. Cross-sanctioned (US + Israel): 9.
Tools: goldenmatch 1.4.4, polars 1.39, Python 3.12.
Hardware: Windows laptop, 32 GB RAM.
Code and raw outputs: benzsevern/goldenmatch-wallet-attribution (MIT). Scripts: fetch_public_data.py, extract_ofac.py, run_15m.py, analyze_15m.py, analyze_malicious.py. Headline JSON: output_15m/report.json. Cross-sanctioned records: output_15m/cross_sanctioned.json.
Data date: 2026-04-09.

Originally published at https://bensevern.dev

Hot take that the benchmark backs up: traditional OSS entity resolution trusts you, the user, to know what you're doing. On 50,000 rows of real healthcare data on a laptop, that trust is misplaced. Full writeup, real numbers, honest disclaimers

benzsevern — Wed, 08 Apr 2026 17:21:57 +0000

benzsevern

Apr 8

The OSS ER Bargain: What Entity Resolution Actually Costs You

#python #datascience #opensource #benchmarking

Comments

9 min read

The OSS ER Bargain: What Entity Resolution Actually Costs You

benzsevern — Wed, 08 Apr 2026 17:19:56 +0000

The OSS ER Bargain: What Entity Resolution Actually Costs You

Benchmarking dedupe vs GoldenMatch on 500,000 CMS provider records

The National Plan and Provider Enumeration System (NPPES) publishes one of the largest open healthcare directories in the world: 6+ million U.S. providers, updated monthly, with names spelled four different ways, addresses that drift across quarters, and enough Smiths and Garcias to keep any blocking algorithm honest. It's a reasonable stand-in for the kind of data most organizations actually have: real, messy, and big enough to hurt.

I wanted to see what it costs to resolve a dataset like this with traditional open-source entity resolution, versus a holistic approach. So I took 500,000 randomly-sampled records from the March 2026 NPPES release and pointed two tools at them: dedupe, the canonical Python OSS deduper, and GoldenMatch, the matching engine at the heart of the Golden Suite.

This isn't a precision/recall bake-off. NPPES ships no ground-truth duplicate labels, and I refused to inject synthetic ones — faking the test data to prove a point is cheating. What I measured instead is what it actually feels like to use each tool: wall-clock runtime, peak memory, how many decisions you have to make, and — critically — whether the tool can even finish the job.

The OSS bargain

dedupe is, in many ways, the textbook open-source entity resolution library. It's well-documented, actively maintained, used in production at real companies, and its active-learning approach is genuinely clever: rather than make you write deterministic rules, it surfaces pairs of records it's uncertain about and asks you to label them.

That cleverness has a cost, and the cost is you.

Setting up dedupe on NPPES means answering a sequence of questions the tool can't answer itself:

Which fields do you want to match on? Pick wrong and your recall tanks.
What types are they — String, Exact, ShortString, Price, LatLong? Each has different behavior and you need to know which.
How should it sample training pairs? What sample_size? What blocked_proportion? These numbers shape what dedupe even sees.
Is your labeler honest? Without ground truth, you're either clicking through uncertain pairs yourself, or — as I did here — writing a deterministic rule that labels pairs programmatically. Either way, you own the decision.
What threshold do you partition at? 0.5? 0.3? 0.7? The number is yours. dedupe will not tell you which one is right for your data.
index_predicates=True or False? In dedupe 3.x, the "True" path needs an extra explicit indexing step or it crashes with NoIndexError mid-partition. I found this out the hard way.

None of these questions have wrong answers in isolation. What they have in common is that every one of them is a decision the user has to make, and every one of them silently changes the output of the algorithm downstream. dedupe trusts you to know what you're doing. When you don't, you get quiet failure.

The holistic alternative

GoldenMatch takes a different approach. You still write a config — I'm not going to pretend it's zero-configuration — but the config describes what your data is, not how dedupe should learn to resolve it. The blocking strategy, the scorers, the weight vectors, the clustering step, and the schema inference are all owned by the library. You point it at your polars DataFrame and call dedupe_df.

Here's the whole GoldenMatch setup I used for NPPES:

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(fields=["last_name"], transforms=["soundex"]),
            BlockingKeyConfig(fields=["zip"], transforms=[]),
            BlockingKeyConfig(fields=["org_name"], transforms=["substring:0:3"]),
        ],
        max_block_size=500,
        skip_oversized=True,
    ),
    matchkeys=[
        MatchkeyConfig(
            name="provider", type="weighted", threshold=0.75,
            fields=[
                MatchkeyField(field="first_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
                MatchkeyField(field="last_name", scorer="jaro_winkler", weight=2.0, transforms=["lowercase", "strip"]),
                MatchkeyField(field="org_name", scorer="token_sort", weight=1.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="address", scorer="token_sort", weight=1.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="city", scorer="jaro_winkler", weight=0.5, transforms=["lowercase", "strip"]),
                MatchkeyField(field="zip", scorer="exact", weight=1.0),
            ],
        ),
    ],
)

result = goldenmatch.dedupe_df(df, config=config)

That's the whole thing. Three blocking passes (phonetic surname, exact zip, organization prefix), six weighted field scorers, one threshold. No training loop. No uncertain-pair labeling. No "did I pick the right number of training pairs" anxiety.

What happened at 50,000 rows

I ran both tools on a 50,000-row slice of the NPPES sample:

Metric	`dedupe`	GoldenMatch	Ratio
Wall-clock runtime	3,589 s (59.8 min)	17.3 s	207×
Peak process RSS	8,699 MB	602 MB	14×
Multi-record clusters found	0	2,857	—
Config lines	206	148	1.4×
Human decisions required	8+ (see list above)	3 (blocking, scorers, threshold)	—

The runtime and memory numbers are jaw-dropping on their own. But look at the "multi-record clusters found" row. dedupe returned zero clusters with more than one record. It produced 50,000 singletons — a perfectly unhelpful partition that says every record is its own entity.

This is not because NPPES has no duplicates. GoldenMatch found 2,857 multi-record clusters on the same data: real matches like PETER ROBERT NEHREBECKI at 240 SHOTWELL ST STE 206 appearing twice under different NPIs, or organizational providers sharing an address and a taxonomy code. The duplicates are there. dedupe just couldn't see them.

Why not? Because dedupe's classifier needs balanced positive and negative training pairs, and the deterministic rule oracle I fed it (match iff same NPI, or same normalized last_name + first_name + zip5) rarely triggers in a random 50k slice of NPPES. Without enough positives, the classifier collapses to "everything is distinct," sklearn warns "only one class in y," and you wait an hour for an output that says nothing.

Could I fix this? Yes. I could loosen the rule oracle, or pre-seed with softer matches, or hand-label pairs, or try a different classifier. All of those are more decisions I'd have to make — decisions that dedupe's design says are mine to own. I ran it honestly, with a clearly-documented protocol, and honestly is what I got.

Scaling out: does GoldenMatch survive 500,000?

Having established that dedupe is not going to finish NPPES at any interesting scale on a laptop, I ran GoldenMatch up the ladder.

Tier	GoldenMatch runtime	Peak RSS	Multi-record clusters	Records collapsed
50,000	17.3 s	602 MB	2,857	2,857
100,000	47.0 s	731 MB	~9,511	9,511
500,000	261.0 s	2,150 MB	~120,191	120,191

Ten times the data, fifteen times the runtime, four times the memory, and roughly forty times the duplicates found. Sub-linear scaling on cluster count — unsurprising, since large datasets surface more duplicate pairs per row. The 500k run finished in 4 minutes 21 seconds using 2.1 GB of RAM on a Windows laptop. Whatever dedupe was doing with its 8.7 GB and its hour of CPU at 50k, GoldenMatch was doing 10× the work in a quarter of the time and a quarter of the memory.

What the sensitivity analysis actually shows

I also swept GoldenMatch through 5 config variations at 50k — four threshold values (0.65, 0.70, 0.80, 0.85) plus a stricter weight preset — and measured Adjusted Rand Index against the default run:

Variant	ARI vs default
`threshold=0.65`	0.5044
`threshold=0.70`	0.7299
`threshold=0.80`	0.4716
`threshold=0.85`	0.2821
`preset_strict`	0.8505

Here's what I want to flag honestly: GoldenMatch's output is sensitive to threshold. The ARI range across variants is 0.57 — that's a lot of movement. If your only claim was "holistic ER is stable under config changes," this table would undermine you.

I don't think that's the right claim.

The right claim is: the knobs work. When you tighten the threshold from 0.65 to 0.85, GoldenMatch produces noticeably stricter clusters — exactly as you'd expect. The threshold is a real, functional control surface, not a cosmetic dial. A sensitivity of 0.57 ARI means the tool actually does different things when you ask it to.

And — here's the uncomfortable counterpart — I cannot compare this to dedupe's sensitivity, because dedupe at 50k produces all-singletons at every threshold. Dedupe's "sensitivity" is 0.0 because the output is trivially constant: nothing, nothing, nothing, nothing. Perfect stability, zero utility.

That's the shape of the real comparison. One tool has knobs that work on a job it can actually finish. The other tool's knobs don't matter because it never got to a meaningful output in the first place.

What "holistic" actually means

When I say GoldenMatch's approach is holistic, I do not mean "it hides the hard decisions from you." Clearly it doesn't — the threshold matters, the blocking choices matter, the scorer weights matter. You can see every one of them in the config block above.

What I mean is that GoldenMatch owns the decisions the user shouldn't have to own:

Whether to build an index over blocking predicates, and when to release it. dedupe makes this your problem and crashes if you guess wrong.
Whether to fall back to a lookup table when a block grows oversized. dedupe blows your memory budget before you notice.
How to assemble per-field scores into a cluster decision, and how to verify that decision across the transitive closure of pairs. dedupe leaves this to a classifier whose training data you have to provide.
How to handle the case where your labeled training set has no positives. dedupe collapses silently. GoldenMatch doesn't need labels.

The OSS bargain is: the library gives you flexibility, and the cost is that you own the consequences of every degree of freedom it exposes. That's fine for small datasets, clean schemas, and practitioners who already know what they're doing. On 500,000 rows of real NPPES data on a laptop, it's not a bargain — it's a trap.

The disclaimers

I want to be precise about what this benchmark is and isn't:

No ground truth. NPPES doesn't ship duplicate labels, and I didn't inject synthetic ones. Every "duplicates found" number is what each tool reports, not what is objectively correct. Some of GoldenMatch's 2,857 clusters at 50k are probably wrong. Without ground truth, I can't tell you the precision or recall of either tool. What I can tell you is that 0 is not the right answer.
Dedupe's labeling protocol matters a lot. I used a deterministic rule (NPI equality OR normalized last_name + first_name + zip5 equality) to label pairs for dedupe. A different protocol — a hand-labeled training set, or a looser rule — would likely give dedupe a fighting chance to learn a real classifier. My protocol is strict on purpose: it's the kind of thing a data engineer would actually write when they need a reproducible pipeline without human-in-the-loop labeling. If your protocol is softer, your results will differ.
Memory numbers include the Python interpreter and loaded libraries. Peak RSS is measured via psutil.Process().memory_info().rss sampled every 500ms in a background thread. Both tools share the same baseline, so the comparison is fair, but don't read "8,699 MB" as "what dedupe's data structures allocated" — read it as "what the process was holding at its peak."
GoldenMatch benefits from recent memory-management work. The Golden Suite has had explicit OOM-prevention work over the last several months. Dedupe doesn't. That asymmetry is real, and I'm not pretending it isn't. If you ran this on dedupe's preferred architecture (e.g., with Postgres-backed storage via dedupe-examples), the memory number would improve — at the cost of adding Postgres to your workflow, which is yet another decision you'd have to make.
dedupe is an excellent tool in its lane. I'm not here to bury it. On small, labeled datasets with an engaged human, it does exactly what it says on the tin. The point of this post is that "small, labeled, with an engaged human" is a much narrower lane than it looks, and lots of real-world ER problems fall outside it.

Closing

If you take nothing else from this post, take this: the cost of an entity resolution tool is not the license fee, it's the number of decisions the tool hands back to you.

dedupe hands you the field types, the blocking predicates, the sample size, the training labels, the classifier choice, the index strategy, the threshold, and the prayer that it all adds up to something useful. At 50,000 rows of NPPES on my laptop, it did not.

GoldenMatch hands you a config, runs, and tells you the answer. The answer is opinionated — the threshold matters, the weights matter — but the tool finishes the job, and the job at scale is the job that actually matters.

Your mileage will vary. Your data is not NPPES. Your hardware is not my laptop. Your labeling protocol is not my labeling protocol. But the next time you're evaluating an ER tool, don't just ask "what accuracy does it reach?" — ask "on my data, at my scale, with the time I have, does it finish?"

For NPPES on a laptop, the answer to that question is already decided.

Reproducibility footer.

Source data: NPPES Full Replacement Monthly NPI File, March 2026 (V2) release.
URL: https://download.cms.gov/nppes/NPPES_Data_Dissemination_March_2026_V2.zip
Downloaded: 2026-04-08T15:01:58Z
Zip SHA-256: 34ba67637c69bc72dfe48f28625d3988550c679fdbc95786af543228912cb463
Sample: 500,000 rows via streaming reservoir sample (seed=42), columns pinned to npi, entity_type, org_name, last_name, first_name, middle_name, address, city, state, zip, taxonomy.
Tools: dedupe (3.x), goldenmatch 1.4.3, Python 3.12.
Hardware: Windows laptop, 32 GB RAM.
Code: comparison_bench/ in the golden-showcase repo. Scripts: data_prep.py, run_dedupe_nppes.py, run_goldenmatch_nppes.py, feasibility_probe_nppes.py, bench_utils.py.
Raw results: results_dedupe_nppes.json, results_goldenmatch_nppes.json, results_feasibility_nppes.json, plus per-run cluster sidecars in comparison_bench/clusters/.

bensevern.dev

benzsevern / goldenmatch

Entity resolution and deduplication toolkit — outperforms Splink, dedupe, and RecordLinkage on cross-domain benchmarks. Zero-config. MST cluster auto-splitting. Quality-weighted survivorship. 30 MCP tools on Smithery. 10 A2A skills. 97.2% F1 on DBLP-ACM.

GoldenMatch

Find duplicate records in 30 seconds. No rules to write, no models to train.

pip install goldenmatch
goldenmatch dedupe customers.csv

Why GoldenMatch?

Zero-config — auto-detects columns, picks scorers, and runs. No training data needed
97.2% F1 on DBLP-ACM out of the box. DQBench ER score: 95.30
Privacy-preserving — match across organizations without sharing raw data (PPRL, 92.4% F1)
30 MCP tools — use from Claude Desktop, Claude Code, or any AI assistant (Smithery)
Production-ready — Postgres sync, daemon mode, lineage tracking, review queues

Choose your path

I want to...	Go here
Deduplicate a CSV right now	Quick Start
Use from Claude Desktop / AI assistant	MCP Server
Build AI agents that deduplicate	ER Agent (A2A)
Write Python code	Python API
Use the interactive TUI	TUI Guide

All features (click to expand)

Matching

10+ scoring methods — exact, Jaro-Winkler, Levenshtein, token sort, soundex, ensemble, embedding, record embedding, dice…

View on GitHub

MCPs enabling data cleaning and deduping.

benzsevern — Tue, 07 Apr 2026 17:27:49 +0000

benzsevern

Apr 7

Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit

#python #opensource #mcp #aiagents

Comments

5 min read

Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit

benzsevern — Tue, 07 Apr 2026 17:02:45 +0000

An AI agent can write SQL, draft an email, and refactor a repo. Ask it to deduplicate a 50,000-row customer file and it will cheerfully hand you a pandas.drop_duplicates() one-liner that finds zero matches. The model knows the concept. It does not know your data, and it has no tool that actually solves entity resolution.

The Model Context Protocol (MCP) is the missing wire. It lets a host like Claude Code, Cursor, or any agent runtime call real tools running on your machine — with real schemas, real parameters, and real results. Golden Suite was built as a set of composable Python packages from day one, which makes it a near-perfect fit. This post walks through how we expose Golden Suite over MCP, what that unlocks for AI workflows, and where the roadmap goes from here.

What MCP actually is

MCP is a thin JSON-RPC protocol that standardises three things between an AI host and an external server:

Tools — typed functions the model can call (goldenmatch.dedupe, infermap.map_schema)
Resources — readable artifacts the model can pull into context (a sample of a CSV, a profiling report)
Prompts — pre-baked instruction templates the host can offer the user

The host handles the LLM. The server handles the work. The contract between them is a stable schema, which means the same Golden Suite MCP server works in Claude Desktop, Cursor, Continue, or a custom Agent SDK app — without rewriting any glue code.

Why Golden Suite fits

Each Golden Suite package is already a small, well-typed Python API:

Package	What it does	Natural MCP tool
infermap	Schema mapping between source and target	`map_schema(source, target)`
GoldenCheck	Profiling and data quality scanning	`profile(path)`, `quality_report(path)`
GoldenFlow	Auto-transformation of messy values	`clean(path, rules?)`
GoldenMatch	Entity resolution and deduplication	`dedupe(path, config?)`
GoldenPipe	Orchestrates the full pipeline	`run_pipeline(path)`

Wrapping these as MCP tools is mostly metadata — the underlying functions already accept paths, return structured results, and stream progress. A minimal server looks like this:

from mcp.server.fastmcp import FastMCP
from goldenmatch import dedupe
from infermap import map_schema
from goldenpipe import run_pipeline

mcp = FastMCP("golden-suite")

@mcp.tool()
def goldenmatch_dedupe(path: str, threshold: float = 0.85) -> dict:
    """Deduplicate a CSV using fuzzy entity resolution."""
    result = dedupe(path, threshold=threshold)
    return {
        "input_rows": result.input_rows,
        "clusters": result.clusters,
        "golden_records": result.golden_records,
        "match_rate": result.match_rate,
    }

@mcp.tool()
def infermap_map(source_csv: str, target_schema: str) -> dict:
    """Map columns from a source CSV to a target schema."""
    return map_schema(source_csv, target_schema).to_dict()

@mcp.tool()
def goldenpipe_run(path: str) -> dict:
    """Run profile → clean → dedupe in one shot."""
    return run_pipeline(path).summary()

if __name__ == "__main__":
    mcp.run()

Drop that into a Claude Desktop config and the model now has hands.

What this actually unlocks

The interesting part is not "Claude can call dedupe." It is what happens when a planning model can chain these tools against real files in a single conversation.

1. Conversational data cleaning

A user drags a CSV into Claude and says "make this usable." The agent calls goldencheck_profile, sees 18% missing zip codes and three different date formats, calls goldenflow_clean, then goldenmatch_dedupe, and reports back: "5,426 rows in, 4,891 golden records out, 535 fuzzy duplicates merged. Here are 12 clusters that look low-confidence — want to review them?" No code written, no docs read.

2. Schema mapping inside an ETL agent

Today, mapping a vendor's cust_id to your customer_id is a human-in-the-loop chore. With infermap exposed over MCP, an agent building an ingestion pipeline can call infermap_map(source, target), get a confidence-scored mapping, and only ask the human about the columns it isn't sure about. The boring 80% disappears.

3. Reverse ETL where the AI is the ETL

Once the agent can both map and match, it can take an arbitrary file and merge it into an existing identity store without a pre-written job. That is the underlying bet behind Golden Suite — an autonomous identity layer — and MCP is the surface that lets the agent reach it.

4. Honest accuracy reporting

Because the tools return structured results (cluster counts, match rates, confidence histograms), the model can quote real numbers instead of inventing them. When an agent says "I deduplicated this," you can verify the claim against the tool output. That is a much better story than "trust the LLM."

What stays hard

MCP does not solve everything. A few things still need care:

Resource limits. Running a 401K-row dedup inside an interactive chat session is a great way to OOM your laptop. The server has to enforce row limits or stream to a background job.
Auth and scoping. An agent with a dedupe(path) tool can read any file the server can read. Path allowlists matter.
Determinism. LLM-boost paths use embeddings and an LLM tiebreaker — runs need to be reproducible enough that "the agent did it" is auditable.
Cost visibility. When the agent triggers a paid LLM-boost step, the user should see it before it happens, not after.

None of these are MCP problems specifically — they are the same problems any agent-callable tool has — but they shape how the server gets built.

Future direction

The MCP server is the front door. The interesting roadmap is what sits behind it.

Phase 2 — Identity Store as a resource. Once the persistent identity store lands, MCP exposes it as a resource the agent can read from and write to. An agent ingesting a new file does not just dedupe within the file — it merges into the canonical store and gets back stable IDs.

Conversational correction. The paid Golden Suite features are built around correcting the model's mistakes in natural language ("merge these two clusters", "split this one"). MCP makes this a first-class loop: the agent surfaces low-confidence clusters as a prompt, the user corrects them in chat, and the corrections feed back into the matcher's learned config.

Ingestion connectors as MCP tools. The Phase 2 ingestion layer (warehouses, databases, SaaS APIs) becomes a family of MCP tools — snowflake_pull, salesforce_pull, postgres_pull — that hand data straight into the existing pipeline tools. The agent can then say "pull yesterday's leads from Salesforce, dedupe against the identity store, and push the merges back." End-to-end, with no glue code.

Multi-agent pipelines. Once each step is an MCP tool, you can run a planner agent that decomposes a high-level goal ("clean and merge all of Q1's vendor files") into parallel sub-agents, each calling the same Golden Suite server. The server becomes the shared substrate; the agents become disposable.

Public hosted MCP endpoint. Long-term, a hosted Golden Suite MCP server means you don't have to install anything — point your agent host at a URL, authenticate, and you have a data cleaning toolkit. That is the Golden Suite product surface in one sentence.

Key takeaways

MCP is the standard wire between AI hosts and real tools — it removes the per-host glue code.
Golden Suite's package boundaries map almost one-to-one onto MCP tools.
The unlock is not "Claude can call dedupe" — it is conversational, end-to-end data cleaning where the agent chains profile → clean → match → merge against real data.
The roadmap points at the identity store, conversational correction, ingestion connectors, and a hosted endpoint — each one extends what an agent can do without leaving the chat.

Try it

Golden Suite is on PyPI today — pip install goldenmatch, pip install infermap, pip install goldenpipe. The MCP server wrapper is the thinnest layer on top, and if you want to point Claude Desktop or Cursor at a local Golden Suite install, the snippet above is a working starting point. Star the repo, try the live tools in the playground, and if you want an MCP-first workflow shipped sooner rather than later — let me know what you would call first.

Originally published at https://bensevern.dev

From Dirty CSV to Golden Records: A Python Walkthrough

benzsevern — Tue, 07 Apr 2026 17:02:43 +0000

Download a government CSV, load it into pandas, and you'll find "MEMORIAL HOSPITAL" listed twelve times across six states. Run drop_duplicates() — it finds zero exact copies. Try deduplicating on facility name alone — it merges hospitals that are genuinely different. Data cleaning and deduplication in Python requires more than one-liners. It requires a coordinated pipeline that profiles, cleans, and matches records in sequence.

The Dataset
Why drop_duplicates() Fails on Real Data
Zero-Config Data Cleaning in One Line
Part 1: Explicit Config & Domain Knowledge
Part 2: LLM Boost — When String Matching Isn't Enough
Key Takeaways

The Dataset

The CMS Hospital General Information file is a public dataset from data.cms.gov listing every Medicare-certified hospital in the United States. We downloaded the April 2026 snapshot.


df = pl.read_csv("hospitals.csv")
print(df.shape)
# (5426, 38)

5,426 rows. 38 columns. The key fields: facility_name, address, citytown, state, zip_code, telephone_number, hospital_type, hospital_ownership.

Here's a sample of what the raw data looks like:

facility_name	address	citytown	state	telephone_number
MEMORIAL HOSPITAL	3801 SPRING AVE	DECATUR	IL	(217) 876-8121
MEMORIAL HOSPITAL	4500 MEMORIAL DR	BELLEVILLE	IL	(618) 233-7750
MEMORIAL HOSPITAL	116 EAST 12TH STREET	JASPER	IN	(812) 996-2345
ST LUKES MEDICAL CENTER	1800 E VAN BUREN ST	PHOENIX	AZ	(602) 251-8100
FLORIDA STATE HOSPITAL UNIT 14 PSYCH	PO BOX 1000	CHATTAHOOCHEE	FL	(850) 663-7536
FLORIDA STATE HOSPITAL UNIT 31 MED	PO BOX 1000	CHATTAHOOCHEE	FL	(850) 663-7536

Phone numbers use (xxx) xxx-xxxx formatting. Some addresses abbreviate "STREET" as "ST" while others spell it out. The same hospital name appears across multiple states. And in a few cases, the same physical facility shows up as two rows with different unit designations.

Why `drop_duplicates()` Fails on Real Data

The instinct is to reach for pandas drop_duplicates(). Let's try it three ways.

Attempt 1: All columns.


df = pd.read_csv("hospitals.csv")
dupes = df.duplicated().sum()
print(dupes)
# 0

Zero exact duplicates. Every row differs on at least one column — different phone format, different whitespace, different unit number. Real-world data almost never has perfect row-level copies.

Attempt 2: Facility name only.

dupes = df.duplicated(subset=["facility_name"]).sum()
print(dupes)
# 131

131 rows flagged. But this is wrong in the other direction — 87 hospital names appear more than once because they're genuinely different hospitals in different states. "MEMORIAL HOSPITAL" in Decatur, IL is not the same facility as "MEMORIAL HOSPITAL" in Jasper, IN. Deduplicating on name alone merges records that should stay separate.

Attempt 3: Manual fuzzy matching.

from fuzzywuzzy import fuzz

# Compare every pair? 5,426 * 5,425 / 2 = 14.7 million comparisons
# Even at 10,000 comparisons/sec, that's 24 minutes
# And you still need to decide: what threshold? which columns? how to merge?

You could write a custom fuzzy matcher — lowercase everything, strip whitespace, compute Levenshtein ratios. But you'd need to handle blocking (which records to compare), scoring (how to weight name vs address vs phone), and merging (how to pick the canonical record). That's hundreds of lines of brittle code for one dataset.

The core problem: naive approaches either miss real duplicates or merge records that shouldn't be merged. You need profiling, cleaning, and matching as a coordinated pipeline.

Zero-Config Data Cleaning in One Line

GoldenPipe runs the full scan-clean-deduplicate pipeline in a single call. If you're new to GoldenPipe, the getting started guide covers installation and core concepts.


result = gp.run("hospitals.csv")

print(result.status)     # "completed"
print(result.timing)     # {total: 3.1, check: 0.4, flow: 0.4, match: 2.0}

Or from the command line:

goldenpipe run hospitals.csv

Click Run to process a sample of the hospital data through the full pipeline. The playground sample includes 5,000 rows with the 11 key columns — the numbers below were generated from the full 38-column dataset.


result = gp.run("hospitals.csv")

print(result.status)
print(result.timing)

3.1 seconds total. That one call ran scan, clean, and deduplicate across all 5,426 rows. Let's look at each stage.

Stage 1: GoldenCheck — Scan

GoldenCheck profiled all 38 columns and reported 155 quality findings in 0.4 seconds.

Click to see the breakdown of the 155 findings
| Finding Type | Count | What It Caught |
|---|---|---|
| pattern_consistency | 53 | Phone formats, address abbreviation patterns |
| nullability | 38 | Columns with significant missing values |
| cardinality | 30 | Low-cardinality columns like hospital_type (8 values) |
| range_distribution | 15 | Numeric outliers in zip codes and CMS ratings |
| type_inference | 10 | Phone/zip stored as strings but parseable as other types |
| drift_detection | 3 | Distribution shifts across data segments |
| null_correlation | 3 | Columns that are null together (correlated missingness) |
| format_detection | 2 | Mixed formatting within single columns |
| uniqueness | 1 | Near-unique columns like facility_id |

The pattern_consistency findings are the most actionable. GoldenCheck detected that all 5,426 phone numbers follow (xxx) xxx-xxxx formatting — consistent but not normalized. It flagged 82 addresses with mixed abbreviation patterns ("STREET" vs "ST", "AVENUE" vs "AVE") and 52 facility names with inconsistent casing or whitespace.

GoldenCheck doesn't fix anything — it hands findings to GoldenFlow.

Stage 2: GoldenFlow — Clean

GoldenFlow read GoldenCheck's 155 findings and applied targeted transforms. 5,832 cells changed in 0.4 seconds.

Column	Cells Changed	Before	After
telephone_number	5,426	(217) 876-8121	+12178768121
address	82	116 EAST 12TH STREET	116 E 12TH ST
facility_name	52	ST LUKES MEDICAL CENTER	ST LUKES MEDICAL CENTER
hospital_ownership	271	Government - Federal	GOVERNMENT - FEDERAL

Phone normalization: Every phone number converted from (xxx) xxx-xxxx to E.164 (+1xxxxxxxxxx). This isn't cosmetic — E.164 is the standard for downstream matching, API calls, and database storage.

Address standardization: 82 addresses had inconsistent abbreviations. GoldenFlow normalized "STREET" to "ST", "AVENUE" to "AVE", "BOULEVARD" to "BLVD" — the USPS standard forms.

Name cleanup: 52 facility names had trailing whitespace or double spaces. Invisible to the eye, fatal to exact matching.

Ownership normalization: 271 ownership values standardized to consistent casing. Small change, but it prevents false cardinality inflation downstream.

Zero config. GoldenFlow used GoldenCheck's findings to decide which transforms were safe to apply automatically.

Stage 3: GoldenMatch — Deduplicate (Zero-Config)

GoldenMatch ran entity resolution on the cleaned data. Here are the numbers:

Metric	Count
Input records	5,426
Golden records (cluster representatives)	479
Records flagged as duplicates	1,917
Unique (no matches)	3,509
Total distinct entities	3,988
Processing time	2.0s

479 clusters. 1,917 records flagged as duplicates. GoldenMatch's internal record count (5,905) differs from the input (5,426) because GoldenFlow's transforms can expand rows when splitting multi-value fields. The match rate is computed against the internal count.

What the clusters look like

Here are a few example clusters GoldenMatch produced:

Cluster	Records	facility_name	state
1	12	MEMORIAL HOSPITAL	IL, IN, PA, GA, CO, TX, ...
2	8	COMMUNITY HOSPITAL	OH, MO, IN, OK, ...
3	5	ST MARY'S HOSPITAL	MO, WI, MI, NJ, NY
4	4	REGIONAL MEDICAL CENTER	AL, MS, TN, SC

Why zero-config over-matched

479 clusters is too many for this dataset. The auto-config built blocking keys on facility name — the most obvious matching column. But hospital names are not unique identifiers. "MEMORIAL HOSPITAL" appears 12 times across different states. They are genuinely different hospitals.

Without geographic anchoring, GoldenMatch grouped every "MEMORIAL HOSPITAL" into one cluster, every "COMMUNITY HOSPITAL" into another. The auto-config had no way to know that hospitals with the same name in different states are different entities. It did exactly what it was designed to do — match records with similar names — but the domain requires geographic context.

This is the honest trade-off of zero-config: it's fast and catches obvious patterns, but it can over-match when names are common and geography matters. For hospital data specifically, you need to tell the matcher to only compare records within the same state.

Ground-truth caveat: The CMS dataset has no duplicate labels. These numbers measure how many records GoldenMatch grouped, not verified precision. The 479 clusters include both genuine duplicates and false positives from cross-state name matching. For production use, review borderline pairs with the review queue or goldenmatch evaluate.

Part 1: Explicit Config — Encoding Domain Knowledge

Zero-config over-matched because it lacked geographic context. Let's fix that with an explicit config that encodes what we know about hospital data.

Step 1: Blocking — Same State Only

The most important change. Instead of comparing all hospitals with similar names, restrict comparisons to hospitals in the same state.

from goldenmatch import GoldenMatchConfig, BlockingConfig, MatchKeyConfig

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            {"keys": ["state", "zip_code"]},          # Pass 1: same state + zip
            {"keys": ["state", "facility_name_3"]},    # Pass 2: same state + first 3 chars of name
        ]
    ),
)

Pass 1 catches hospitals at the same zip code — the tightest geographic net. Pass 2 catches hospitals in the same state with similar names — wider but still geographically anchored. This means "MEMORIAL HOSPITAL" in IL will never be compared to "MEMORIAL HOSPITAL" in IN.

Step 2: Scoring — Weighted Ensemble

Hospital names carry the most signal, but address and phone provide confirmation.

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            {"keys": ["state", "zip_code"]},
            {"keys": ["state", "facility_name_3"]},
        ]
    ),
    matchkeys=[
        MatchKeyConfig(column="facility_name", method="ensemble", weight=2.0),
        MatchKeyConfig(column="address", method="token_sort", weight=1.5),
        MatchKeyConfig(column="telephone_number", method="exact", weight=0.5),
        MatchKeyConfig(column="zip_code", method="exact", weight=0.3),
    ],
    threshold=0.80,
)

Why these weights? Facility name gets 2.0 because it's the primary identifier. Address gets 1.5 with token_sort because word order varies ("1800 E VAN BUREN ST" vs "1800 EAST VAN BUREN STREET"). Phone gets 0.5 as a confirmation signal — same phone strongly suggests same facility, but different phones don't rule it out (multi-line hospitals). Zip gets 0.3 as a tiebreaker.

Why 0.80 threshold? Hospital abbreviations ("ST" vs "SAINT", "MED CTR" vs "MEDICAL CENTER") drag fuzzy scores down. A threshold of 0.80 catches these while filtering noise.

Step 3: Run It


result = gp.run("hospitals.csv", match_config=config)
print(result.timing)
# {total: 3.0, check: 0.4, flow: 0.4, match: 2.2}

Results

Metric	Count
Input records	5,426
Clusters found	6
Records flagged as duplicates	12
Unique (no matches)	5,414
Total distinct entities	5,420
Processing time	2.2s

6 clusters. Down from 479. The state-based blocking eliminated all the cross-state false positives.

The 6 Genuine Clusters

Every cluster is a real same-state match:

Cluster	State	Records	What Matched
Crenshaw Community Hospital	AL	2	Same facility, minor address variation
Wiregrass Medical Center	AL	2	Same facility, data entry differences
Bullock County Hospital	AL	2	Same facility, different record versions
Florida State Hospital (Unit 14 Psych / Unit 31 Med)	FL	2	Same campus, different unit designations
Progressive Health Group of Houston	MS	2	Same facility, record variants
Carthage Area Hospital ("WEST STREET" vs "WEST ST")	NY	2	Same facility, address abbreviation

The Florida State Hospital cluster is particularly interesting — Unit 14 (Psych) and Unit 31 (Med) are different departments at the same physical campus with the same phone number and PO Box address. Whether these should be merged depends on your use case. For a facility-level analysis, yes. For a department-level analysis, no.

The Carthage Area Hospital cluster shows exactly the kind of match that drop_duplicates() misses: "WEST STREET" vs "WEST ST" — same address, different abbreviation.

Part 2: LLM Boost — When String Matching Isn't Enough

String matching measures visual similarity. LLMs understand meaning. A hospital rebrand from "County General" to "Mercy Health Partners" has zero string overlap but an LLM can reason about the context. For the theory and mechanics of LLM-assisted deduplication, see the LLM boost deep dive.

Here's the config with LLM scoring enabled:

from goldenmatch import LLMScorerConfig

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            {"keys": ["state", "zip_code"]},
            {"keys": ["state", "facility_name_3"]},
        ]
    ),
    matchkeys=[
        MatchKeyConfig(column="facility_name", method="ensemble", weight=2.0),
        MatchKeyConfig(column="address", method="token_sort", weight=1.5),
        MatchKeyConfig(column="telephone_number", method="exact", weight=0.5),
        MatchKeyConfig(column="zip_code", method="exact", weight=0.3),
    ],
    threshold=0.80,
    llm_scorer=LLMScorerConfig(
        candidate_lo=0.65,
        candidate_hi=0.80,
        calibration_sample_size=100,
        max_cost_usd=0.50,
    ),
)

result = gp.run("hospitals.csv", match_config=config)

The LLM scorer examines pairs that fall in the "uncertainty zone" — between 0.65 (too low to match) and 0.80 (already matched by fuzzy scoring). These are the borderline cases where string similarity alone can't decide.

Results: 0 Additional Pairs

The LLM scored zero additional pairs. Not because it failed — because there were no candidates in the uncertainty zone. Every pair was either above 0.80 (already matched) or below 0.65 (clearly not a match).

This is the honest story. For well-structured data with strong geographic blocking, explicit config is already so precise that the LLM has nothing to evaluate. The blocking passes constrain comparisons to same-state records, and within a state, hospital names either match clearly or don't match at all. There's no ambiguous middle ground.

When LLM Boost Does Help

LLM scoring shines on datasets where:

Names have semantic variation: "County General Hospital" vs "Mercy Health Partners" (rebrand)
Blocking is looser: Blocking on city alone produces more candidate pairs in the uncertainty zone
Abbreviation patterns are inconsistent: Some records use "MED CTR" while others use "MEDICAL CENTER" — fuzzy scores land around 0.70-0.78
Multilingual data: "Hospital Municipal" vs "City Hospital" — zero string overlap, same entity

On the CMS hospital data with state-based blocking, the explicit config already catches everything the LLM would. The $0.50 budget went unspent.

The Full Picture

Three approaches on the same 5,426 records:

	Zero-Config	Explicit Config	Explicit + LLM
Clusters found	479	6	6
Records merged	1,917	12	12
Distinct entities	3,988	5,420	5,420
Time	3.1s	3.0s	3.0s
Cost	$0	$0	$0
Config effort	None	~20 lines	~30 lines

Ground-truth caveat: None of these numbers are verified precision — the CMS data has no duplicate labels. The comparison shows relative improvement across approaches. The 479 zero-config clusters are demonstrably inflated (cross-state matching of common names), while the 6 explicit-config clusters pass manual inspection. For production use, verify matches with the review queue or goldenmatch evaluate.

The progression tells the real story:

Zero-config ran the full pipeline in 3.1 seconds with no effort. It caught real patterns (phone normalization, address standardization) but over-matched on deduplication because hospital names repeat across states.
Explicit config added 20 lines of domain knowledge — state-based blocking and weighted scoring — and dropped false positives from 479 clusters to 6. Same speed. Dramatically better results.
LLM boost found nothing additional on this dataset, which is the correct outcome. The explicit config was already precise enough. On messier data with semantic name variation, the LLM earns its keep.

Key Takeaways

drop_duplicates() barely scratches the surface. Zero exact duplicates in 5,426 real hospital records. The duplicates are there — they just don't look identical.
A coordinated pipeline beats three separate scripts. GoldenCheck's findings feed GoldenFlow's transforms, which feed GoldenMatch's scoring. Each stage builds on the last.
Zero-config gets you started in one line. 155 findings, 5,832 cells cleaned, deduplication complete — all in 3.1 seconds. Good enough for exploration and prototyping.
Zero-config can over-match. When names are common and geography matters, auto-blocking without domain context produces false positives. Always inspect the clusters.
Explicit config encodes domain knowledge. 20 lines of config — state-based blocking + weighted scoring — reduced false positives by 98%. The data tells you what the config should be.
LLM boost is for the long tail, not every dataset. Well-structured data with strong blocking may not need it. Save it for messy, semantic, or multilingual matching problems.

Try It Yourself

Try the GoldenPipe Playground

On your machine:

pip install goldenpipe
goldenpipe run hospitals.csv

Explore the source:

benzsevern / goldenpipe

Golden Suite orchestrator — chains validation (GoldenCheck), transformation (GoldenFlow), and entity resolution (GoldenMatch). 4 MCP tools on Smithery. DQBench Pipeline: 88.07.

GoldenPipe

Golden Suite orchestrator -- Check quality, fix issues, deduplicate records. One command.

What It Does

Raw Data
  | GoldenCheck   -- profile & discover quality issues
  | GoldenFlow    -- fix issues, standardize, reshape
  | GoldenMatch   -- deduplicate, match, create golden records
  v
Golden Records

GoldenPipe orchestrates the full pipeline with adaptive logic:

Skips transformation if no quality issues found
Routes to privacy-preserving matching if sensitive fields detected
Reports reasoning for every decision

Install

pip install goldenpipe

Quick Start

import goldenpipe as gp

result = gp.run("customers.csv")

print(result.status)        # "success"
print(result.check)         # Quality findings
print(result.transform)     # What was fixed
print(result.match)         # Deduplicated clusters
print(result.reasoning)     # Why each decision was made

CLI

goldenpipe run customers.csv                # Full pipeline
goldenpipe run customers.csv --verbose      #

…

View on GitHub

Originally published at https://bensevern.dev

401K messy equipment records, LLM-calibrated scoring, 12 seconds. Here's how.

benzsevern — Sat, 04 Apr 2026 17:49:55 +0000

benzsevern

Apr 4

Deduplicating 401,000 Equipment Auction Records with LLM Calibration

#python #ai #datascience #dataengineering

Comments

6 min read

The same 10 data issues show up in every dataset. Here are the one-liner fixes.

benzsevern — Sat, 04 Apr 2026 17:49:27 +0000

benzsevern

Apr 4

10 Data Problems Every Pipeline Hits (and the One-Liner Fixes)

#python #dataengineering #tutorial #datascience

Comments

4 min read

Benchmarked 4 Python dedup libraries on the same dataset. Results surprised me.

benzsevern — Sat, 04 Apr 2026 17:48:56 +0000

benzsevern

Apr 4

GoldenMatch vs. Splink vs. Dedupe vs. RecordLinkage: A Practical Comparison

#python #datascience #opensource #dataengineering

Comments

8 min read

DEV Community: benzsevern

Wagner Was on OFAC in 2018: What 10 Years of Sanctions Data Reveals

The three datasets

Finding 1: Wagner was on OFAC four years before the current narrative admits

Finding 2: 1 in 5 OFAC designations gets reversed

Finding 3: Sanctioned crypto wallets are almost entirely new public intelligence

Finding 4: 63.9% of sanctioned entities are on exactly one list

Finding 5: PEP screening is a much weaker leading indicator than vendors suggest

Finding 6: Russia and the West sanction disjoint populations

Honest limitations

Takeaways

Reproduce it

infermap Now Runs in TypeScript: Schema Mapping on the Edge

What infermap Does in 30 Seconds

The Seven Scorers

What the TypeScript Port Looks Like

Database support

Custom scorers

Config persistence

Same Algorithm, Same Accuracy

Zero Runtime Dependencies in Core

Doors This Opens

Upload-time schema resolution

Full-stack type safety

Monorepo workflows

Browser-based mapping UIs

Shared config between Python and TypeScript

What Stayed the Same

Install and Try It

benzsevern / infermap

Inference-driven schema mapping engine for Python and TypeScript. 7 built-in scorers, domain dictionaries (healthcare/finance/ecommerce), confidence calibration, cross-language accuracy benchmark (F1 0.84), and full Python↔TypeScript parity.

infermap

Table of contents

Install

Python

Key Takeaways

Reconciling 15 OSS Vulnerability Databases: What They Actually Cover

The fifteen sources

The schema and the join

Finding 1: GitHub reviews 9.1% of what it ingests

Finding 2: The JavaScript ecosystem has more tracked vulnerabilities than everything else combined

Finding 3: The free OSS stack is structurally blind to system-level vulnerabilities

What else is in the data

Honest limitations

Takeaways

Reproduce it

Wallet Attribution at Scale: ER on 13M Blockchain Records

The ten sources

The schema

Running GoldenMatch at 535k

Scaling to 13 million

The honest caveat

What the 13M run surfaced

1. Nine wallets cross-sanctioned by two governments

2. The largest clusters are universal infrastructure

3. Attackers verify source code at the long-tail baseline rate

What else is in the data

Honest limitations

Takeaways

Reproduce it

Hot take that the benchmark backs up: traditional OSS entity resolution trusts you, the user, to know what you're doing. On 50,000 rows of real healthcare data on a laptop, that trust is misplaced. Full writeup, real numbers, honest disclaimers

The OSS ER Bargain: What Entity Resolution Actually Costs You

The OSS ER Bargain: What Entity Resolution Actually Costs You

The OSS ER Bargain: What Entity Resolution Actually Costs You

The OSS bargain

The holistic alternative

What happened at 50,000 rows

Scaling out: does GoldenMatch survive 500,000?

What the sensitivity analysis actually shows

What "holistic" actually means

The disclaimers

Closing

benzsevern / goldenmatch

Entity resolution and deduplication toolkit — outperforms Splink, dedupe, and RecordLinkage on cross-domain benchmarks. Zero-config. MST cluster auto-splitting. Quality-weighted survivorship. 30 MCP tools on Smithery. 10 A2A skills. 97.2% F1 on DBLP-ACM.

GoldenMatch

Why GoldenMatch?

Choose your path

Matching

MCPs enabling data cleaning and deduping.

Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit

Why `drop_duplicates()` Fails on Real Data