benzsevern

Posted on Apr 4 • Originally published at bensevern.dev

GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark

#python #tutorial #opensource #datascience

How well does your deduplication tool handle profiles that are designed to fool it?

Amazon published BPID (Benchmark for Personal Identity Deduplication) at EMNLP 2024 — the first open-source benchmark specifically for PII matching. It includes 10,000 profile pairs where even GPT-4 and fine-tuned BERT models struggle to tell matches from non-matches.

We ran GoldenMatch against it. No training data, no fine-tuning. Just string similarity primitives, date parsing, and Vertex AI embeddings.

What Makes BPID Hard

Most entity resolution benchmarks (DBLP-ACM, Abt-Buy, Febrl) test whether your system can find similar records. BPID tests whether it can not match records that look similar but aren't.

Each profile has five attributes:

Field	Format	Challenge
`fullname`	Free text	Nicknames (Bill/William), gender variants (Daniel/Danielle), reordering (Smith John -> John Smith)
`email`	List of addresses	Shared domains, similar usernames across different people
`phone`	List of numbers	Country code variations, partial numbers, formatting noise
`addr`	List of addresses	Same street different state, semantic variations (100th vs one hundredth)
`dob`	Free text date	Format variations (1990-11-14 vs 14 nov 1990), partial dates

The dataset has 4,333 match pairs and 5,667 no-match pairs. The no-match pairs are intentionally adversarial — two different people named "Damien Skinner" and "Skinner Damien" sharing an email address and phone number, but with contradicting birthdates. A naive string similarity approach will confidently match them.

On top of that, ~18% of attribute values are missing. Some profiles have a single-letter name and no email. You get a fullname of "b" paired with "marshal jennifer bivens" — and they're labeled as a match.

The Published Baselines

The BPID paper benchmarked several methods:

Method	Type	Precision	Recall	F1
Random Forest	Traditional (hand-crafted features)	0.653	0.609	0.629
Ditto	Pre-trained language model	0.746	0.804	0.752
Sudowoodo	Pre-trained language model (SOTA)	0.774	0.802	0.788

The Random Forest uses hand-engineered string similarity features. Ditto and Sudowoodo are BERT-based models fine-tuned on labeled pairs. Even Claude 3 Sonnet and GPT-4 Turbo were tested — LLMs scored well but still made systematic errors on phone number digit comparison (tokenization struggles with exact digit counts).

Our Approach

GoldenMatch wasn't designed for BPID's pair classification format. It's a deduplication engine — you feed it a table of records and it finds clusters. So we adapted its scoring primitives for pairwise comparison and iterated through three configurations.

Config 1: Naive Weighted Scoring (0.665 F1)

Our first pass used GoldenMatch's field-level primitives (score_field, apply_transforms) with a list-aware scorer. BPID profiles have multi-valued fields (lists of emails, phones, addresses), so we score each element pair and take the maximum.

from rapidfuzz.distance import JaroWinkler
from rapidfuzz.fuzz import token_sort_ratio

def ensemble_name(a: str, b: str) -> float:
    """GoldenMatch ensemble: max(jaro_winkler, token_sort, soundex*0.8)"""
    jw = JaroWinkler.similarity(a, b)
    ts = token_sort_ratio(a, b) / 100.0
    sx = 1.0 if jellyfish.soundex(a) == jellyfish.soundex(b) else 0.0
    return max(jw, ts, sx * 0.8)

For identifier fields (email, phone), we check for exact overlap first — one shared email between two profiles is a strong match signal per BPID's annotation rules. The final score is a weighted average across all available fields.

This gave us 0.665 F1 — above the Random Forest baseline (0.629), but the score distribution told us why it wasn't higher:

Match pairs:    mean=0.828
No-match pairs: mean=0.715
Gap:            0.113

Only 0.11 separation. Some no-match pairs score a perfect 1.0.

Config 2: Optimized Classical Scoring (0.747 F1)

The breakthrough was proper DOB parsing. Our naive scorer compared raw digit strings — "14 nov 1953" and "1953-11-14" produce different digit sequences despite being the same date. We built a date parser that extracts (year, month, day) components from free-text dates:

def parse_dob(dob: str) -> tuple[int | None, int | None, int | None]:
    """Parse free-text DOB into (year, month, day) components.

    Handles: '1953 11 09', '09 nov 1953', '19530911',
             'nov 1953', '09 2007', 'jul 18sat 1953'
    """
    # Extract month names, then parse remaining numbers
    # Try YYYYMMDD, then positional disambiguation
    ...

With parsed components, a contradicting birth year is a near-certain no-match signal — different people share names and addresses, but rarely share a birthdate. We weighted year contradictions at 2.5x.

We also improved phone normalization (strip country codes, compare last 10 digits) and name scoring (first-name extraction to detect gender swaps like Daniel/Danielle).

The result: 0.747 F1 — a +0.08 jump from DOB parsing alone.

Match pairs:    mean=0.899
No-match pairs: mean=0.715
Gap:            0.184

The score gap nearly doubled. Precision jumped from 0.541 to 0.655, eliminating ~1,200 false positives.

Config 3: Classical + Vertex AI Embeddings (0.750 F1)

We embedded all 20,000 profiles using Vertex AI's text-embedding-004 (768 dimensions) and computed cosine similarity for each pair. Embeddings alone scored 0.658 F1 — worse than classical scoring because the embedding gap was only 0.062 (adversarial profiles are semantically similar by design).

But blending 65% classical + 35% embedding produced 0.750 F1 — a small but real improvement. The embedding captures semantic relationships that string matching misses (Bill/William, abbreviated addresses) while the classical scorer provides the structural discrimination (DOB parsing, exact identifier overlap).

Results

Method	Precision	Recall	F1	Training Data	Time
Random Forest (BPID paper)	0.653	0.609	0.629	Yes	--
GoldenMatch classical	0.655	0.869	0.747	No	0.2s
GoldenMatch + embeddings	0.672	0.849	0.750	No	~8min
Ditto (BPID paper)	0.746	0.804	0.752	Yes	--
Sudowoodo (BPID paper)	0.774	0.802	0.788	Yes	--

GoldenMatch matches Ditto (0.750 vs 0.752) with zero training data. The gap to Sudowoodo (0.788) remains — fine-tuned BERT models that learn PII-specific representations still have an edge on adversarial data.

Note the precision-recall balance: GoldenMatch trades higher recall (0.849-0.869) for lower precision (0.655-0.672) compared to the PLMs. In production, this tradeoff is tunable via the threshold — at t=0.87, GoldenMatch hits 0.718 precision / 0.734 recall / 0.726 F1.

What Made the Difference

DOB parsing was the single biggest lever

Going from raw digit comparison to parsed (year, month, day) components was worth +0.08 F1. A birth year contradiction is the strongest no-match signal in PII data — stronger than different names (people change names) or different addresses (people move).

Embeddings help, but not as much as you'd think

Vertex AI embeddings added only +0.003 F1 on top of the optimized classical scorer. The reason: BPID's adversarial pairs are designed to be semantically similar. "Daniel" and "Danielle" are close in embedding space. The embedding helps most on genuine matches with unusual formatting, but can't reject the traps.

Multi-valued fields need max-over-pairs scoring

BPID profiles have lists of emails, phones, and addresses. Concatenating them into a single string and running Jaro-Winkler produces poor results. Scoring each element pair and taking the maximum matches BPID's annotation rule: "one shared element = match for that attribute."

First-name extraction catches gender swaps

BPID includes deliberate negative name pairs: Daniel/Danielle, Jon/John, Mary/Mark. The ensemble scorer gives these high similarity (~0.85+). Extracting tokens and checking that at least one name token matches well across profiles catches many of these.

LLM boost actually hurt performance

We sent 4,747 borderline pairs (hybrid score 0.66-0.86) to GPT-4.1-mini. The result surprised us: F1 dropped from 0.750 to 0.737. The LLM achieved only 60.7% accuracy on borderline pairs — barely better than random. It said "yes" to 2,646 of 4,747 pairs, creating more false positives than it eliminated.

Why? The same adversarial design that makes BPID hard for string matchers also tricks LLMs. Two profiles with the same name, similar emails, and overlapping phone numbers look like a match to a language model — it can't reliably detect that the birthdates contradict or that the phone numbers differ by exactly the last four digits. The BPID paper observed the same pattern: even GPT-4 Turbo and Claude 3 Sonnet make systematic errors on digit comparison because tokenization obscures exact digit counts.

The lesson: on adversarial PII data, structured feature engineering (parsing dates into components, normalizing phone numbers, checking first-name tokens) outperforms LLM reasoning. The LLM adds value on real-world data where the challenge is variety, not adversarial traps.

Running This Yourself

pip install goldenmatch

Download BPID from Zenodo (Apache 2.0 license, 58MB).


from goldenmatch.core.scorer import score_field
from goldenmatch.utils.transforms import apply_transforms

# Load BPID matching dataset
pairs, labels = [], []
with open("matching_dataset.jsonl") as f:
    for line in f:
        d = json.loads(line)
        pairs.append((d["profile1"], d["profile2"]))
        labels.append(d["match"] == "True")

# Score a pair with GoldenMatch primitives
name_a = apply_transforms("corrie arreola", ["lowercase", "strip"])
name_b = apply_transforms("arreola corrie", ["lowercase", "strip"])
score = score_field(name_a, name_b, "token_sort")  # 1.0 — handles reordering

The full benchmark scripts (naive, optimized, embedding, LLM boost) are in the bpid_bench directory.

Where GoldenMatch Fits

BPID is a pair classification benchmark — given two profiles, decide match or no-match. GoldenMatch is built for a different task: given a table of N records, find all duplicate clusters. The pair scoring approach here uses GoldenMatch's primitives outside their normal pipeline context.

For production PII deduplication, GoldenMatch's pipeline adds:

Blocking — reduces O(N^2) comparisons to manageable candidate sets
Clustering (Union-Find) — produces transitive groups, not just pairwise decisions
Golden rules — merges clusters into canonical records
LLM calibration — handles borderline pairs for ~$0.01

On structured data with standard blocking keys, GoldenMatch hits 97.2% F1 on DBLP-ACM and processes 401K records in under 30 seconds.

BPID tests a specific, adversarial corner: PII profiles with intentional near-miss traps. GoldenMatch matches Ditto's F1 without training data — and the classical scorer runs in 0.2 seconds.

Key Takeaways

GoldenMatch scores 0.750 F1 on BPID — matching Ditto (0.752), above Random Forest (0.629)
Zero training data — no labeled pairs, no fine-tuning, no GPU training
DOB parsing was the biggest win — proper date component extraction added +0.08 F1
Embeddings provide marginal gains — Vertex AI embeddings added +0.003 F1 over classical scoring
0.2 seconds for classical scoring — 41,000+ pairs/sec on a laptop
The precision-recall tradeoff is tunable — adjust threshold for your use case

Try GoldenMatch on your own data: pip install goldenmatch or try it in the playground.

Originally published at https://bensevern.dev

DEV Community