benzsevern

Posted on Apr 9 • Originally published at bensevern.dev

Wallet Attribution at Scale: ER on 13M Blockchain Records

#python #datascience #opensource #blockchain

Every public blockchain attribution dataset is a partial, opinionated view of the same underlying reality. OFAC publishes ~800 sanctioned crypto wallets. Etherscan crowdsources ~50,000 tags across seven EVM chains. Sourcify holds ~14 million verified contract deployments. Forta tracks known-malicious contracts. DeFiLlama catalogs protocol addresses. Israel's Ministry of Defense and the FBI's Lazarus unit each publish their own targeted lists. None of them agree, none of them are complete, and almost none of them talk to each other.

I wanted to know what happens if you reconcile all of them. Not with a custom schema, not with hand-written joins — with a single entity resolution pipeline pointed at every public source I could find. The answer is 13,147,920 input rows, 30,958 multi-source clusters, and three findings I could not have produced at smaller scale.

The ten sources

I pulled every freely redistributable blockchain attribution dataset I could verify:

#	Source	Rows	What it covers
1	OFAC SDN Enhanced XML	788	US Treasury sanctioned wallets, 18 chains
2	brianleect/etherscan-labels	52,773	Crowdsourced Etherscan tags, 7 EVM chains
3	dawsbot/eth-labels	17,495	Curated Ethereum mainnet categories
4	Sourcify parquet exports	13,062,088	Verified contract deployments, all chains
5	Forta labelled-datasets	7,480	Known malicious contracts + phishing
6	DeFiLlama protocols	3,332	Protocol contract addresses
7	ScamSniffer blacklist	2,530	Reported scam addresses
8	ethereum-lists	717	Dark/light address lists
9	OpenSanctions: il_mod_crypto	684	Israel MoD sanctioned wallets
10	OpenSanctions: us_fbi_lazarus_crypto	33	FBI Lazarus Group wallets
	Total	13,147,920

Sourcify dominates by two orders of magnitude. Everything else is a long tail of curated, opinionated, high-signal labels. That asymmetry shapes the whole story: Sourcify tells you what addresses exist, the other nine tell you what they mean, and entity resolution is what turns one into the other.

The schema

Every source projects to five columns:

COMMON_COLS = ["address_norm", "address_raw", "entity_name", "label", "source"]

address_norm is the joining key: lowercase, 0x prefix stripped.
address_raw keeps the original format for display.
entity_name is whatever the source calls it ("LAZARUS GROUP", "Safe: Proxy Factory 1.3.0", "").
label is the source-specific tag, namespaced (etherscan:etherscan:ofac-sanctions-lists).
source is the dataset identifier.

No fuzzy matching on names. Names disagree too often to be a primary signal — that's actually the central finding. The only reliable join is address_norm.

Running GoldenMatch at 535k

I started at a sensible scale: five sources plus Sourcify's Ethereum mainnet subset, 535,336 rows, and a direct call to goldenmatch.dedupe:


result = gm.dedupe(
    str(STAGED / "01_ofac.csv"),
    str(STAGED / "02_etherscan_labels.csv"),
    str(STAGED / "03_eth_labels.csv"),
    str(STAGED / "04_ethereum_lists.csv"),
    str(STAGED / "05_defillama.csv"),
    str(STAGED / "06_sourcify.csv"),
    exact=["address_norm"],
    blocking=["address_norm"],
)

This finished in about 40 seconds on a Windows laptop, found 12,640 multi-member clusters, and auto-fixed 51 data quality issues in the raw public sources (smart quotes, invisible characters, stray whitespace) before matching. GoldenCheck's quality scanner is bundled into the dedupe call — you don't ask for it, it just happens.

The results at 535k surfaced the best single anecdote in the whole exercise:

Address	OFAC name	Etherscan name
`0x098B716B8Aaf21512996dC57EB0615e2383E2f96`	LAZARUS GROUP	Ronin Bridge Exploiter
`0x5f48c2a71b2cc96e3f0ccae4e39318ff0dc375b2`	SEMENOV, Roman	Tornado.Cash: Team 1 Vesting

The first row is the Axie Infinity Ronin Bridge wallet — the address behind the $625 million Lazarus Group hack, labeled by OFAC as "LAZARUS GROUP" and by Etherscan as "Ronin Bridge Exploiter." Two correct names, completely unrelated strings. A name-based join finds nothing. An address-normalized join finds the link instantly. The second row ties a sanctioned Tornado Cash co-founder to a specific named vesting contract. If you took only one thing from this post, take that: names disagree, addresses don't.

Scaling to 13 million

The 535k run validated the pipeline. I wanted to know what happened at the real ceiling of free public data. That meant pulling all 14 Sourcify deployment parquets (one per million contracts, ~2 GB total) covering every chain Sourcify tracks — not just Ethereum mainnet.

# fetch_all_sourcify.py — parallel download of 14 parquets
with ThreadPoolExecutor(max_workers=4) as ex:
    for fut in as_completed([ex.submit(fetch_parquet, f) for f in files]):
        fut.result()

And then the staging step streams them directly into the common schema via Polars without ever materializing the full frame:

for pq in parquets:
    df = pl.read_parquet(pq, columns=["chain_id", "address"])
    out = df.select([
        pl.col("address").bin.encode("hex").alias("address_norm"),
        ("0x" + pl.col("address").bin.encode("hex")).alias("address_raw"),
        pl.lit("").alias("entity_name"),
        ("sourcify:chain_" + pl.col("chain_id").cast(pl.Utf8)).alias("label"),
        pl.lit("sourcify").alias("source"),
    ])
    out.write_csv(f, include_header=False)

After staging, the full dataset is 13,147,920 rows across 10 sources.

The honest caveat

At 13M rows, calling goldenmatch.dedupe crashes at the cluster-materialization step with a MemoryError in the Python dict build-out. That's not a GoldenMatch bug — it's pure Python object overhead on 12 million unique cluster keys. Since the full pipeline was already reducing to exact-match-on-address_norm blocking (names disagree too much to fuzzy on), the operation is mathematically equivalent to a polars groupby. I wrote that directly:

all_df = pl.concat([pl.scan_csv(p) for p in STAGED.glob("*.csv")]).collect()

clusters = (
    all_df.group_by("address_norm")
    .agg([
        pl.col("source").unique().alias("sources"),
        pl.col("source").n_unique().alias("n_sources"),
        pl.col("source").len().alias("size"),
        pl.col("entity_name").filter(pl.col("entity_name") != "").alias("entities"),
        pl.col("label"),
    ])
)

Same logical result, fits in memory, runs in about 30 seconds. The 535k run proves the ER pipeline works end-to-end with GoldenMatch's full feature set (fuzzy scorers, blocking strategies, lineage, golden records). The 13M run uses GoldenMatch's auto-config decisions as the template but delegates the exact-match groupby to Polars because Python dicts are the wrong data structure at that volume. I want to be upfront about that — the scale-up is not an endorsement of "GoldenMatch scales to 13M natively," it's an endorsement of "GoldenMatch chose the right blocking strategy at 535k, and that strategy is trivially reproducible at 13M in a columnar engine."

What the 13M run surfaced

1. Nine wallets cross-sanctioned by two governments

This is the headline finding and only possible because I had two independent sanctions sources. Nine crypto wallets appear on both the US Treasury OFAC list and Israel's MoD sanctioned crypto list:

Wallet	Entity	Chain
`TCzq6m2zxnQkrZrf8cqYcK6bbXQYAfWYKC`	ZEDCEX EXCHANGE LTD	Tron
`TGsNFrgWfbGN2gX25Wcf8oTejtxtQkvmEx`	ZEDCEX EXCHANGE LTD	Tron
`TTS9o5KkpGgH8cK9LofLmMAPYb5zfQvSNa`	ZEDCEX EXCHANGE LTD	Tron
`TNuA5CQ6LB4jTHoNrjEeQZJmcmhQuHMbQ7`	ZEDCEX EXCHANGE LTD	Tron
`TLvuvpfBKdxddxSsJefeiGCe9eVY8HUroE`	ZEDCEX EXCHANGE LTD	Tron
`TWBAPzpPiZarfVsY2BLXeaLhNHurn4wkWG`	AL-LAW, Tawfiq Muhammad Sa'id	Tron
`0x175d44451403Edf28469dF03A9280c1197ADb92c`	GAZA NOW	Ethereum
`0x21B8d56BDA776bbE68655A16895afd96F5534feD`	GAZA NOW	Ethereum
`19D1iGzDr7FyAdiy3ZZdxMd6ttHj1kj6WW`	BUY CASH MONEY AND MONEY TRANSFER CO	Bitcoin

The ZEDCEX cluster is the standout: five wallets on a single Tron-based exchange independently sanctioned by both the United States and Israel. GAZA NOW contributes two cross-confirmed Ethereum wallets. These are the highest-confidence sanctioned wallets in the entire dataset — not because any individual list is more authoritative, but because two independent government entity resolution processes landed on the same on-chain identities.

You cannot find this with OFAC alone. You cannot find it with Israel's list alone. You find it only when you reconcile them.

2. The largest clusters are universal infrastructure

At multi-chain scale, the top multi-source clusters by member count are all deterministic-deployment contracts:

Size	Address	What it is
45	`0x7cbb62eaa69f79e6873cd1ecb2392971036cfaa4`	Safe: Create Call 1.3.0
43	`0x40a2accbd92bca938b02010e17a5b8929b49130d`	Safe: Multi Send Call Only 1.3.0
42	`0xa6b71e26c5e0845f74c812102ca7114b6a896ab2`	Safe: Proxy Factory 1.3.0
40	`0x3e5c63644e683549055b9be8653de26e0b4cd36e`	Safe: Singleton L2 1.3.0
39	`0xd9db270c1b5e3bd161e8c8503c55ceabee709552`	Safe: Singleton 1.3.0
37	`0xf48f2b2d2a534e402487b3ee7c18c33aec0fe5e4`	Safe: Compatibility Fallback Handler 1.3.0
30	`0x000000000022d473030f116ddee9f6b43ac78ba3`	Uniswap Permit2
27	`0x66a71dcef29a0ffbdbe3c6a460a3b5bc225cd675`	LayerZero Ethereum Endpoint

Safe (formerly Gnosis Safe) uses CREATE2 with chain-independent salts, which means the same singleton contract ends up at the same address on every EVM chain it's deployed to. So do Permit2 and LayerZero. The cluster size is literally a count of "how many chains is this deployed on?" — 45 chains for Create Call 1.3.0, 30 for Permit2.

That's a real finding about the structure of the modern EVM ecosystem. Entity resolution on multi-chain contract deployment data automatically surfaces the universal-infrastructure layer without anyone asking it to. If you're building an allowlist of "standard reusable contracts that are safe-by-reputation across every chain," this cluster table is a reasonable starting point. I did not go in looking for this — it just fell out of the data.

3. Attackers verify source code at the long-tail baseline rate

I also pulled Forta's labelled-datasets, which include 719 known-malicious Ethereum smart contracts and 569 phishing-scam contracts. The honest question: do attackers publish verified source code on Sourcify?

Population	Size	Verified on Sourcify
Forta malicious contracts	719	3 (0.4%)
Forta phishing contracts	569	3 (0.5%)
ScamSniffer addresses	2,530	0 (0%)

This is where I have to resist the obvious headline. "Malicious contracts almost never verify source code" is technically true but misleading: Sourcify holds ~324k verified Ethereum mainnet contracts against an estimated ~70M+ total contracts ever deployed, which puts the baseline verification rate around 0.5%. Malicious contracts at 0.4% are statistically indistinguishable from that baseline.

The defensible framing is this: malicious contracts behave like the long tail of random/abandoned/spam contracts, not like production contracts. Mainstream DeFi protocols verify at rates well above 50%. Attackers don't. For the purposes of attribution, "is this a Sourcify-verified contract?" on its own is a weak filter — but "verified Sourcify contract AND has Etherscan tags AND appears in eth-labels AND DeFiLlama" is an extremely strong legitimacy signal. The 301 quadruple-confirmed clusters at 13M scale are the set of contracts that every independent attribution observer agrees exist and matter.

The 3 verified malicious contracts are outliers worth manual investigation: two are Fake_Phishing tagged contracts that nonetheless published source (presumably to look legitimate to a casual reviewer), and one is a suspicious "TrueEUR" token.

What else is in the data

A few secondary findings that didn't need their own section:

Deployer-reuse patterns in the malicious-contracts dataset. The Forta dataset records contract_creator for each malicious contract. Grouping by creator surfaces a heavily skewed distribution: one address deployed 15 Fake_Phishing contracts, another deployed 11, and the original bZx Exploiter 1 wallet deployed 9 distinct exploit contracts — all tied back to the 2020 bZx flash-loan attack. Twelve deployer addresses are responsible for roughly 15% of the entire labeled malicious-contract corpus. Watching deployers is dramatically more efficient than watching deployments, and the clustering falls out of ER trivially.

OFAC's internal duplicates. The SUEX OTC wallets appear in OFAC's own list twice — once under the XBT:CYBER2 program and once under USDT:CYBER2, because Treasury sanctioned the same Bitcoin address for Bitcoin activity and the Tron-USDT it bridged through. Without ER you'd treat them as two distinct records; with ER the internal duplication is obvious.

Cross-source pattern distribution. At 13M the dominant multi-source pattern is etherscan-labels + sourcify (12,472 clusters) — verified contracts that are also tagged. Then eth-labels + forta (5,146) — curated DeFi labels overlapping with malicious flags. Then the triple-confirmed eth-labels + etherscan-labels + sourcify (5,034). The full distribution is in output_15m/report.json in the companion repo.

Honest limitations

I want to be precise about what this analysis is and isn't:

It is not criminal wallet discovery. Every "sanctioned" label comes from a government source. ER reconciles those labels across sources. It does not identify new bad actors. Nothing in this post claims to.
It is not a substitute for on-chain forensics. Chainalysis-style graph tracing answers a completely different question (follow the flows). This pipeline answers "whose opinions do we have about this address, and do they agree?"
Ground truth is bounded. When two sanctions lists agree, you have two-jurisdiction confirmation, which is strong. When Forta and eth-labels agree on a malicious tag, you have two independent community labels, which is weaker. Nothing here is a court case.
The Sourcify baseline assumes the universe of all Ethereum contracts. If you normalize against "contracts anyone cares about" instead of "all contracts ever deployed," the verification-rate story changes. I chose the inclusive denominator on purpose — it's what Sourcify's own data supports.
Names are dirty. I found five different OFAC entries for the same Bitcoin address with slightly different entity spellings; two different Etherscan tags for the same Lazarus wallet; and an Israeli-sourced wallet whose "entity_name" field was just the address itself. ER is only as clean as the input. GoldenCheck auto-fixed 51 text-level issues before matching, but it didn't — and shouldn't — normalize semantic disagreement.

Takeaways

Ten public blockchain attribution datasets, 13.1M records, 30,958 multi-source clusters. The free public attribution universe is larger than it looks if you combine it, and trivially reconcilable if you normalize the address.
Names disagree, addresses don't. The Lazarus Group / Ronin Bridge Exploiter case is the best two-word argument for entity resolution on blockchain data I've seen.
Cross-jurisdictional sanctioning is real and detectable. Nine wallets — including a cluster of five ZEDCEX addresses — are sanctioned by both the US and Israel. You only see this if you reconcile multiple sanctions sources.
ER on multi-chain contract data surfaces universal infrastructure for free. The top clusters are Safe, Permit2, and LayerZero — deployed at the same CREATE2 address across 30-45 chains each. Cluster size is the chain count.
Attackers verify source code at the long-tail baseline rate, not the production rate. The useful signal is not "verified" but "verified and independently tagged by multiple labelers."

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-wallet-attribution. The full flow is four commands:

python fetch_public_data.py    # ~2.5 GB download, ~10 min
python extract_ofac.py         # parse SDN_ENHANCED.xml
python run_15m.py              # stage 10 sources to common schema
python analyze_15m.py          # cross-source cluster analysis

All ten data sources are permissively licensed and redistributable. No API keys. No auth. The ~13M-row analysis finishes in about 3 minutes of wall-clock time on a laptop once the data is local.

If you want to see what GoldenMatch looks like with its full feature set — fuzzy scoring, blocking strategies, lineage tracking, golden records — the earlier archive/run_clusters.py runs it on the 535k-row subset end-to-end. That's the run that surfaced the Lazarus / Ronin Bridge case. Both scripts are preserved because they're answering different questions: does the ER pipeline work? (yes, 535k with GoldenMatch) and what falls out at the public-data ceiling? (the 13M run above).

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the live playground: bensevern.dev/playground.

Reproducibility footer.

Source datasets: OFAC SDN Enhanced XML (US Treasury), Sourcify parquet exports (export.sourcify.dev, manifest timestamp 2026-01-05T16:48:52Z), brianleect/etherscan-labels (main), dawsbot/eth-labels (master), MyEtherWallet/ethereum-lists (master), forta-network/labelled-datasets (main), scamsniffer/scam-database (main), api.llama.fi/protocols, OpenSanctions us_fbi_lazarus_crypto and il_mod_crypto (latest).
Total download: ~2.5 GB.
Input rows: 13,147,920 across 10 sources.
Unique addresses: 12,588,179.
Multi-source clusters: 30,958. Quadruple-confirmed: 301. Cross-sanctioned (US + Israel): 9.
Tools: goldenmatch 1.4.4, polars 1.39, Python 3.12.
Hardware: Windows laptop, 32 GB RAM.
Code and raw outputs: benzsevern/goldenmatch-wallet-attribution (MIT). Scripts: fetch_public_data.py, extract_ofac.py, run_15m.py, analyze_15m.py, analyze_malicious.py. Headline JSON: output_15m/report.json. Cross-sanctioned records: output_15m/cross_sanctioned.json.
Data date: 2026-04-09.

Originally published at https://bensevern.dev

DEV Community