benzsevern

Posted on Mar 24 • Edited on Apr 4

Two Hospitals Matched Patient Records Without Sharing a Single Name

#ai #python #opensource #datascience

Hospital A has 50,000 patient records. Hospital B has 40,000. Some patients visit both. Nobody knows which ones.

They need to find the overlap — for care coordination, billing reconciliation, research. But HIPAA says neither hospital can share raw patient data with the other. No names. No SSNs. No dates of birth. Nothing identifiable.

So how do you match records you're not allowed to see?

Bloom Filters

A bloom filter is a bit array. You take a name like "John Smith," break it into character pairs ("jo", "oh", "hn", "n ", " s", "sm", "mi", "it", "th"), hash each pair into positions in the bit array, and flip those bits to 1.

The result is an encrypted fingerprint. You can't reverse it back to "John Smith." But "Jon Smith" produces a bloom filter with most of the same bits flipped — because most of the character pairs overlap.

Two similar names → overlapping bits → measurable similarity. Without ever decrypting.

The Command

pip install goldenmatch

goldenmatch pprl link \
  --file-a hospital_a.csv \
  --file-b hospital_b.csv \
  --fields first_name,last_name,dob,zip \
  --security high \
  --threshold 0.85

Here's what happens:

Hospital A computes bloom filters for each record locally. Only the encrypted bit arrays leave their system.
Hospital B does the same.
The coordinator receives two sets of bit arrays. Computes Dice similarity between every candidate pair. Returns cluster IDs.

The coordinator never sees a name, a date of birth, or a phone number. Just bits.

Security Levels

GoldenMatch offers three:

standard  → 512-bit filters,  20 hash functions, 2-grams
high      → 1024-bit filters, 30 hash functions, 2-grams  (default)
paranoid  → 2048-bit filters, 40 hash functions, 3-grams

Higher settings make the bloom filters harder to attack but slightly slower to compute. For most healthcare and government use cases, high is the right choice.

Want even stronger guarantees? Switch to the SMC protocol:

goldenmatch pprl link \
  --file-a hospital_a.csv \
  --file-b hospital_b.csv \
  --fields first_name,last_name,dob,zip \
  --protocol smc

Secure Multi-Party Computation — no party sees the other's bloom filters at all. Only match/no-match bits are revealed. Uses secret sharing for the similarity computation.

Auto-Configuration

Don't know which fields to use or what bloom filter size to pick? Let GoldenMatch figure it out:

goldenmatch pprl link \
  --file-a data_a.csv \
  --file-b data_b.csv \
  --fields auto \
  --security high

Auto-config profiles each column — cardinality, average length, null rate, field type — and recommends:

Which fields are most useful for linkage (names + DOB are better than city + state)
Optimal bloom filter size for your data distribution
Hash count tuned to your field lengths
Whether to use 2-grams or 3-grams based on average name length

The Numbers

On the FEBRL4 synthetic benchmark (a standard PPRL evaluation dataset):

Mode	Precision	Recall	F1
Normal fuzzy matching (baseline)	56.5%	74.6%	64.3%
PPRL with auto-config	99.7%	86.1%	92.4%

92.4% F1 with encrypted data. The PPRL mode actually outperforms the unencrypted baseline on this dataset because the bloom filter encoding acts as a normalizer — it's insensitive to formatting differences that trip up raw string matching.

What The Output Looks Like

The output is a standard cluster assignment file:

party,record_id,cluster_id
A,1042,C001
B,3891,C001
A,1043,C002
A,2817,C003
B,5102,C003

Party A's record 1042 and Party B's record 3891 are the same person (cluster C001). Neither party needs to know the other's details — just the cluster ID is enough to coordinate.

Real-World Use Cases

Healthcare — Link patient records across hospitals for care coordination. Avoid duplicate treatments and conflicting prescriptions without violating HIPAA.

Government — Deduplicate voter registration rolls across states. The National Voter Registration Act requires maintenance without exposing SSNs across state lines.

Finance — Anti-money laundering requires matching accounts across institutions. PPRL lets banks compare customer lists for fraud detection without sharing customer PII.

Research — Link survey respondents across longitudinal studies while preserving anonymity. IRB-approved matching without a breach risk.

Marketing — Two companies want to find customer overlap for a partnership. Neither wants to hand over their customer list. PPRL finds the intersection without exposing either list.

How It Compares

Most PPRL implementations are research prototypes — Jupyter notebooks with custom bloom filter code, no CLI, no auto-configuration, no production deployment path.

Existing tools:

Anonlink (CSIRO) — Python library, requires manual setup, no auto-config
PPRL toolkit (University of Leipzig) — Java, research-focused
Splink — Has some privacy features, but primarily a standard linker

GoldenMatch bundles PPRL as one of 20 commands in a production CLI. Same tool that does your regular deduplication can also do privacy-preserving linkage. Same config format. Same output format.

The Important Part

The technical achievement isn't the 92% F1. It's that the entire workflow fits in one command.

No cryptography PhD required. No custom protocol implementation. No separate encryption step followed by a separate matching step followed by a separate decryption step.

goldenmatch pprl link -a file1.csv -b file2.csv --fields name,dob,zip

That's two organizations matching records without trusting each other with raw data. In one line.

GitHub: github.com/benzsevern/goldenmatch
Install: pip install goldenmatch
License: MIT
Docs: Full PPRL guide in the README

If you're dealing with cross-organization matching and compliance is blocking you, try the pprl link command. It might be the shortest path between "we can't share data" and "we found the matches."

DEV Community