DEV Community

benzsevern
benzsevern

Posted on • Originally published at bensevern.dev

Wallet Attribution at Scale: ER on 13M Blockchain Records

Every public blockchain attribution dataset is a partial, opinionated view of the same underlying reality. OFAC publishes ~800 sanctioned crypto wallets. Etherscan crowdsources ~50,000 tags across seven EVM chains. Sourcify holds ~14 million verified contract deployments. Forta tracks known-malicious contracts. DeFiLlama catalogs protocol addresses. Israel's Ministry of Defense and the FBI's Lazarus unit each publish their own targeted lists. None of them agree, none of them are complete, and almost none of them talk to each other.

I wanted to know what happens if you reconcile all of them. Not with a custom schema, not with hand-written joins — with a single entity resolution pipeline pointed at every public source I could find. The answer is 13,147,920 input rows, 30,958 multi-source clusters, and three findings I could not have produced at smaller scale.

The ten sources

I pulled every freely redistributable blockchain attribution dataset I could verify:

# Source Rows What it covers
1 OFAC SDN Enhanced XML 788 US Treasury sanctioned wallets, 18 chains
2 brianleect/etherscan-labels 52,773 Crowdsourced Etherscan tags, 7 EVM chains
3 dawsbot/eth-labels 17,495 Curated Ethereum mainnet categories
4 Sourcify parquet exports 13,062,088 Verified contract deployments, all chains
5 Forta labelled-datasets 7,480 Known malicious contracts + phishing
6 DeFiLlama protocols 3,332 Protocol contract addresses
7 ScamSniffer blacklist 2,530 Reported scam addresses
8 ethereum-lists 717 Dark/light address lists
9 OpenSanctions: il_mod_crypto 684 Israel MoD sanctioned wallets
10 OpenSanctions: us_fbi_lazarus_crypto 33 FBI Lazarus Group wallets
Total 13,147,920

Sourcify dominates by two orders of magnitude. Everything else is a long tail of curated, opinionated, high-signal labels. That asymmetry shapes the whole story: Sourcify tells you what addresses exist, the other nine tell you what they mean, and entity resolution is what turns one into the other.

The schema

Every source projects to five columns:

COMMON_COLS = ["address_norm", "address_raw", "entity_name", "label", "source"]
Enter fullscreen mode Exit fullscreen mode
  • address_norm is the joining key: lowercase, 0x prefix stripped.
  • address_raw keeps the original format for display.
  • entity_name is whatever the source calls it ("LAZARUS GROUP", "Safe: Proxy Factory 1.3.0", "").
  • label is the source-specific tag, namespaced (etherscan:etherscan:ofac-sanctions-lists).
  • source is the dataset identifier.

No fuzzy matching on names. Names disagree too often to be a primary signal — that's actually the central finding. The only reliable join is address_norm.

Running GoldenMatch at 535k

I started at a sensible scale: five sources plus Sourcify's Ethereum mainnet subset, 535,336 rows, and a direct call to goldenmatch.dedupe:


result = gm.dedupe(
    str(STAGED / "01_ofac.csv"),
    str(STAGED / "02_etherscan_labels.csv"),
    str(STAGED / "03_eth_labels.csv"),
    str(STAGED / "04_ethereum_lists.csv"),
    str(STAGED / "05_defillama.csv"),
    str(STAGED / "06_sourcify.csv"),
    exact=["address_norm"],
    blocking=["address_norm"],
)
Enter fullscreen mode Exit fullscreen mode

This finished in about 40 seconds on a Windows laptop, found 12,640 multi-member clusters, and auto-fixed 51 data quality issues in the raw public sources (smart quotes, invisible characters, stray whitespace) before matching. GoldenCheck's quality scanner is bundled into the dedupe call — you don't ask for it, it just happens.

The results at 535k surfaced the best single anecdote in the whole exercise:

Address OFAC name Etherscan name
0x098B716B8Aaf21512996dC57EB0615e2383E2f96 LAZARUS GROUP Ronin Bridge Exploiter
0x5f48c2a71b2cc96e3f0ccae4e39318ff0dc375b2 SEMENOV, Roman Tornado.Cash: Team 1 Vesting

The first row is the Axie Infinity Ronin Bridge wallet — the address behind the $625 million Lazarus Group hack, labeled by OFAC as "LAZARUS GROUP" and by Etherscan as "Ronin Bridge Exploiter." Two correct names, completely unrelated strings. A name-based join finds nothing. An address-normalized join finds the link instantly. The second row ties a sanctioned Tornado Cash co-founder to a specific named vesting contract. If you took only one thing from this post, take that: names disagree, addresses don't.

Scaling to 13 million

The 535k run validated the pipeline. I wanted to know what happened at the real ceiling of free public data. That meant pulling all 14 Sourcify deployment parquets (one per million contracts, ~2 GB total) covering every chain Sourcify tracks — not just Ethereum mainnet.

# fetch_all_sourcify.py — parallel download of 14 parquets
with ThreadPoolExecutor(max_workers=4) as ex:
    for fut in as_completed([ex.submit(fetch_parquet, f) for f in files]):
        fut.result()
Enter fullscreen mode Exit fullscreen mode

And then the staging step streams them directly into the common schema via Polars without ever materializing the full frame:

for pq in parquets:
    df = pl.read_parquet(pq, columns=["chain_id", "address"])
    out = df.select([
        pl.col("address").bin.encode("hex").alias("address_norm"),
        ("0x" + pl.col("address").bin.encode("hex")).alias("address_raw"),
        pl.lit("").alias("entity_name"),
        ("sourcify:chain_" + pl.col("chain_id").cast(pl.Utf8)).alias("label"),
        pl.lit("sourcify").alias("source"),
    ])
    out.write_csv(f, include_header=False)
Enter fullscreen mode Exit fullscreen mode

After staging, the full dataset is 13,147,920 rows across 10 sources.

The honest caveat

At 13M rows, calling goldenmatch.dedupe crashes at the cluster-materialization step with a MemoryError in the Python dict build-out. That's not a GoldenMatch bug — it's pure Python object overhead on 12 million unique cluster keys. Since the full pipeline was already reducing to exact-match-on-address_norm blocking (names disagree too much to fuzzy on), the operation is mathematically equivalent to a polars groupby. I wrote that directly:

all_df = pl.concat([pl.scan_csv(p) for p in STAGED.glob("*.csv")]).collect()

clusters = (
    all_df.group_by("address_norm")
    .agg([
        pl.col("source").unique().alias("sources"),
        pl.col("source").n_unique().alias("n_sources"),
        pl.col("source").len().alias("size"),
        pl.col("entity_name").filter(pl.col("entity_name") != "").alias("entities"),
        pl.col("label"),
    ])
)
Enter fullscreen mode Exit fullscreen mode

Same logical result, fits in memory, runs in about 30 seconds. The 535k run proves the ER pipeline works end-to-end with GoldenMatch's full feature set (fuzzy scorers, blocking strategies, lineage, golden records). The 13M run uses GoldenMatch's auto-config decisions as the template but delegates the exact-match groupby to Polars because Python dicts are the wrong data structure at that volume. I want to be upfront about that — the scale-up is not an endorsement of "GoldenMatch scales to 13M natively," it's an endorsement of "GoldenMatch chose the right blocking strategy at 535k, and that strategy is trivially reproducible at 13M in a columnar engine."

What the 13M run surfaced

1. Nine wallets cross-sanctioned by two governments

This is the headline finding and only possible because I had two independent sanctions sources. Nine crypto wallets appear on both the US Treasury OFAC list and Israel's MoD sanctioned crypto list:

Wallet Entity Chain
TCzq6m2zxnQkrZrf8cqYcK6bbXQYAfWYKC ZEDCEX EXCHANGE LTD Tron
TGsNFrgWfbGN2gX25Wcf8oTejtxtQkvmEx ZEDCEX EXCHANGE LTD Tron
TTS9o5KkpGgH8cK9LofLmMAPYb5zfQvSNa ZEDCEX EXCHANGE LTD Tron
TNuA5CQ6LB4jTHoNrjEeQZJmcmhQuHMbQ7 ZEDCEX EXCHANGE LTD Tron
TLvuvpfBKdxddxSsJefeiGCe9eVY8HUroE ZEDCEX EXCHANGE LTD Tron
TWBAPzpPiZarfVsY2BLXeaLhNHurn4wkWG AL-LAW, Tawfiq Muhammad Sa'id Tron
0x175d44451403Edf28469dF03A9280c1197ADb92c GAZA NOW Ethereum
0x21B8d56BDA776bbE68655A16895afd96F5534feD GAZA NOW Ethereum
19D1iGzDr7FyAdiy3ZZdxMd6ttHj1kj6WW BUY CASH MONEY AND MONEY TRANSFER CO Bitcoin

The ZEDCEX cluster is the standout: five wallets on a single Tron-based exchange independently sanctioned by both the United States and Israel. GAZA NOW contributes two cross-confirmed Ethereum wallets. These are the highest-confidence sanctioned wallets in the entire dataset — not because any individual list is more authoritative, but because two independent government entity resolution processes landed on the same on-chain identities.

You cannot find this with OFAC alone. You cannot find it with Israel's list alone. You find it only when you reconcile them.

2. The largest clusters are universal infrastructure

At multi-chain scale, the top multi-source clusters by member count are all deterministic-deployment contracts:

Size Address What it is
45 0x7cbb62eaa69f79e6873cd1ecb2392971036cfaa4 Safe: Create Call 1.3.0
43 0x40a2accbd92bca938b02010e17a5b8929b49130d Safe: Multi Send Call Only 1.3.0
42 0xa6b71e26c5e0845f74c812102ca7114b6a896ab2 Safe: Proxy Factory 1.3.0
40 0x3e5c63644e683549055b9be8653de26e0b4cd36e Safe: Singleton L2 1.3.0
39 0xd9db270c1b5e3bd161e8c8503c55ceabee709552 Safe: Singleton 1.3.0
37 0xf48f2b2d2a534e402487b3ee7c18c33aec0fe5e4 Safe: Compatibility Fallback Handler 1.3.0
30 0x000000000022d473030f116ddee9f6b43ac78ba3 Uniswap Permit2
27 0x66a71dcef29a0ffbdbe3c6a460a3b5bc225cd675 LayerZero Ethereum Endpoint

Safe (formerly Gnosis Safe) uses CREATE2 with chain-independent salts, which means the same singleton contract ends up at the same address on every EVM chain it's deployed to. So do Permit2 and LayerZero. The cluster size is literally a count of "how many chains is this deployed on?" — 45 chains for Create Call 1.3.0, 30 for Permit2.

That's a real finding about the structure of the modern EVM ecosystem. Entity resolution on multi-chain contract deployment data automatically surfaces the universal-infrastructure layer without anyone asking it to. If you're building an allowlist of "standard reusable contracts that are safe-by-reputation across every chain," this cluster table is a reasonable starting point. I did not go in looking for this — it just fell out of the data.

3. Attackers verify source code at the long-tail baseline rate

I also pulled Forta's labelled-datasets, which include 719 known-malicious Ethereum smart contracts and 569 phishing-scam contracts. The honest question: do attackers publish verified source code on Sourcify?

Population Size Verified on Sourcify
Forta malicious contracts 719 3 (0.4%)
Forta phishing contracts 569 3 (0.5%)
ScamSniffer addresses 2,530 0 (0%)

This is where I have to resist the obvious headline. "Malicious contracts almost never verify source code" is technically true but misleading: Sourcify holds ~324k verified Ethereum mainnet contracts against an estimated ~70M+ total contracts ever deployed, which puts the baseline verification rate around 0.5%. Malicious contracts at 0.4% are statistically indistinguishable from that baseline.

The defensible framing is this: malicious contracts behave like the long tail of random/abandoned/spam contracts, not like production contracts. Mainstream DeFi protocols verify at rates well above 50%. Attackers don't. For the purposes of attribution, "is this a Sourcify-verified contract?" on its own is a weak filter — but "verified Sourcify contract AND has Etherscan tags AND appears in eth-labels AND DeFiLlama" is an extremely strong legitimacy signal. The 301 quadruple-confirmed clusters at 13M scale are the set of contracts that every independent attribution observer agrees exist and matter.

The 3 verified malicious contracts are outliers worth manual investigation: two are Fake_Phishing tagged contracts that nonetheless published source (presumably to look legitimate to a casual reviewer), and one is a suspicious "TrueEUR" token.

What else is in the data

A few secondary findings that didn't need their own section:

Deployer-reuse patterns in the malicious-contracts dataset. The Forta dataset records contract_creator for each malicious contract. Grouping by creator surfaces a heavily skewed distribution: one address deployed 15 Fake_Phishing contracts, another deployed 11, and the original bZx Exploiter 1 wallet deployed 9 distinct exploit contracts — all tied back to the 2020 bZx flash-loan attack. Twelve deployer addresses are responsible for roughly 15% of the entire labeled malicious-contract corpus. Watching deployers is dramatically more efficient than watching deployments, and the clustering falls out of ER trivially.

OFAC's internal duplicates. The SUEX OTC wallets appear in OFAC's own list twice — once under the XBT:CYBER2 program and once under USDT:CYBER2, because Treasury sanctioned the same Bitcoin address for Bitcoin activity and the Tron-USDT it bridged through. Without ER you'd treat them as two distinct records; with ER the internal duplication is obvious.

Cross-source pattern distribution. At 13M the dominant multi-source pattern is etherscan-labels + sourcify (12,472 clusters) — verified contracts that are also tagged. Then eth-labels + forta (5,146) — curated DeFi labels overlapping with malicious flags. Then the triple-confirmed eth-labels + etherscan-labels + sourcify (5,034). The full distribution is in output_15m/report.json in the companion repo.

Honest limitations

I want to be precise about what this analysis is and isn't:

  • It is not criminal wallet discovery. Every "sanctioned" label comes from a government source. ER reconciles those labels across sources. It does not identify new bad actors. Nothing in this post claims to.
  • It is not a substitute for on-chain forensics. Chainalysis-style graph tracing answers a completely different question (follow the flows). This pipeline answers "whose opinions do we have about this address, and do they agree?"
  • Ground truth is bounded. When two sanctions lists agree, you have two-jurisdiction confirmation, which is strong. When Forta and eth-labels agree on a malicious tag, you have two independent community labels, which is weaker. Nothing here is a court case.
  • The Sourcify baseline assumes the universe of all Ethereum contracts. If you normalize against "contracts anyone cares about" instead of "all contracts ever deployed," the verification-rate story changes. I chose the inclusive denominator on purpose — it's what Sourcify's own data supports.
  • Names are dirty. I found five different OFAC entries for the same Bitcoin address with slightly different entity spellings; two different Etherscan tags for the same Lazarus wallet; and an Israeli-sourced wallet whose "entity_name" field was just the address itself. ER is only as clean as the input. GoldenCheck auto-fixed 51 text-level issues before matching, but it didn't — and shouldn't — normalize semantic disagreement.

Takeaways

  • Ten public blockchain attribution datasets, 13.1M records, 30,958 multi-source clusters. The free public attribution universe is larger than it looks if you combine it, and trivially reconcilable if you normalize the address.
  • Names disagree, addresses don't. The Lazarus Group / Ronin Bridge Exploiter case is the best two-word argument for entity resolution on blockchain data I've seen.
  • Cross-jurisdictional sanctioning is real and detectable. Nine wallets — including a cluster of five ZEDCEX addresses — are sanctioned by both the US and Israel. You only see this if you reconcile multiple sanctions sources.
  • ER on multi-chain contract data surfaces universal infrastructure for free. The top clusters are Safe, Permit2, and LayerZero — deployed at the same CREATE2 address across 30-45 chains each. Cluster size is the chain count.
  • Attackers verify source code at the long-tail baseline rate, not the production rate. The useful signal is not "verified" but "verified and independently tagged by multiple labelers."

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-wallet-attribution. The full flow is four commands:

python fetch_public_data.py    # ~2.5 GB download, ~10 min
python extract_ofac.py         # parse SDN_ENHANCED.xml
python run_15m.py              # stage 10 sources to common schema
python analyze_15m.py          # cross-source cluster analysis
Enter fullscreen mode Exit fullscreen mode

All ten data sources are permissively licensed and redistributable. No API keys. No auth. The ~13M-row analysis finishes in about 3 minutes of wall-clock time on a laptop once the data is local.

If you want to see what GoldenMatch looks like with its full feature set — fuzzy scoring, blocking strategies, lineage tracking, golden records — the earlier archive/run_clusters.py runs it on the 535k-row subset end-to-end. That's the run that surfaced the Lazarus / Ronin Bridge case. Both scripts are preserved because they're answering different questions: does the ER pipeline work? (yes, 535k with GoldenMatch) and what falls out at the public-data ceiling? (the 13M run above).

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the live playground: bensevern.dev/playground.


Reproducibility footer.

  • Source datasets: OFAC SDN Enhanced XML (US Treasury), Sourcify parquet exports (export.sourcify.dev, manifest timestamp 2026-01-05T16:48:52Z), brianleect/etherscan-labels (main), dawsbot/eth-labels (master), MyEtherWallet/ethereum-lists (master), forta-network/labelled-datasets (main), scamsniffer/scam-database (main), api.llama.fi/protocols, OpenSanctions us_fbi_lazarus_crypto and il_mod_crypto (latest).
  • Total download: ~2.5 GB.
  • Input rows: 13,147,920 across 10 sources.
  • Unique addresses: 12,588,179.
  • Multi-source clusters: 30,958. Quadruple-confirmed: 301. Cross-sanctioned (US + Israel): 9.
  • Tools: goldenmatch 1.4.4, polars 1.39, Python 3.12.
  • Hardware: Windows laptop, 32 GB RAM.
  • Code and raw outputs: benzsevern/goldenmatch-wallet-attribution (MIT). Scripts: fetch_public_data.py, extract_ofac.py, run_15m.py, analyze_15m.py, analyze_malicious.py. Headline JSON: output_15m/report.json. Cross-sanctioned records: output_15m/cross_sanctioned.json.
  • Data date: 2026-04-09.

Originally published at https://bensevern.dev

Top comments (0)