DEV Community

Cover image for GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark
benzsevern
benzsevern

Posted on • Originally published at bensevern.dev

GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark

How well does your deduplication tool handle profiles that are designed to fool it?

Amazon published BPID (Benchmark for Personal Identity Deduplication) at EMNLP 2024 — the first open-source benchmark specifically for PII matching. It includes 10,000 profile pairs where even GPT-4 and fine-tuned BERT models struggle to tell matches from non-matches.

We ran GoldenMatch against it. No training data, no fine-tuning. Just string similarity primitives, date parsing, and Vertex AI embeddings.

What Makes BPID Hard

Most entity resolution benchmarks (DBLP-ACM, Abt-Buy, Febrl) test whether your system can find similar records. BPID tests whether it can not match records that look similar but aren't.

Each profile has five attributes:

Field Format Challenge
fullname Free text Nicknames (Bill/William), gender variants (Daniel/Danielle), reordering (Smith John -> John Smith)
email List of addresses Shared domains, similar usernames across different people
phone List of numbers Country code variations, partial numbers, formatting noise
addr List of addresses Same street different state, semantic variations (100th vs one hundredth)
dob Free text date Format variations (1990-11-14 vs 14 nov 1990), partial dates

The dataset has 4,333 match pairs and 5,667 no-match pairs. The no-match pairs are intentionally adversarial — two different people named "Damien Skinner" and "Skinner Damien" sharing an email address and phone number, but with contradicting birthdates. A naive string similarity approach will confidently match them.

On top of that, ~18% of attribute values are missing. Some profiles have a single-letter name and no email. You get a fullname of "b" paired with "marshal jennifer bivens" — and they're labeled as a match.

The Published Baselines

The BPID paper benchmarked several methods:

Method Type Precision Recall F1
Random Forest Traditional (hand-crafted features) 0.653 0.609 0.629
Ditto Pre-trained language model 0.746 0.804 0.752
Sudowoodo Pre-trained language model (SOTA) 0.774 0.802 0.788

The Random Forest uses hand-engineered string similarity features. Ditto and Sudowoodo are BERT-based models fine-tuned on labeled pairs. Even Claude 3 Sonnet and GPT-4 Turbo were tested — LLMs scored well but still made systematic errors on phone number digit comparison (tokenization struggles with exact digit counts).

Our Approach

GoldenMatch wasn't designed for BPID's pair classification format. It's a deduplication engine — you feed it a table of records and it finds clusters. So we adapted its scoring primitives for pairwise comparison and iterated through three configurations.

Config 1: Naive Weighted Scoring (0.665 F1)

Our first pass used GoldenMatch's field-level primitives (score_field, apply_transforms) with a list-aware scorer. BPID profiles have multi-valued fields (lists of emails, phones, addresses), so we score each element pair and take the maximum.

from rapidfuzz.distance import JaroWinkler
from rapidfuzz.fuzz import token_sort_ratio

def ensemble_name(a: str, b: str) -> float:
    """GoldenMatch ensemble: max(jaro_winkler, token_sort, soundex*0.8)"""
    jw = JaroWinkler.similarity(a, b)
    ts = token_sort_ratio(a, b) / 100.0
    sx = 1.0 if jellyfish.soundex(a) == jellyfish.soundex(b) else 0.0
    return max(jw, ts, sx * 0.8)
Enter fullscreen mode Exit fullscreen mode

For identifier fields (email, phone), we check for exact overlap first — one shared email between two profiles is a strong match signal per BPID's annotation rules. The final score is a weighted average across all available fields.

This gave us 0.665 F1 — above the Random Forest baseline (0.629), but the score distribution told us why it wasn't higher:

Match pairs:    mean=0.828
No-match pairs: mean=0.715
Gap:            0.113
Enter fullscreen mode Exit fullscreen mode

Only 0.11 separation. Some no-match pairs score a perfect 1.0.

Config 2: Optimized Classical Scoring (0.747 F1)

The breakthrough was proper DOB parsing. Our naive scorer compared raw digit strings — "14 nov 1953" and "1953-11-14" produce different digit sequences despite being the same date. We built a date parser that extracts (year, month, day) components from free-text dates:

def parse_dob(dob: str) -> tuple[int | None, int | None, int | None]:
    """Parse free-text DOB into (year, month, day) components.

    Handles: '1953 11 09', '09 nov 1953', '19530911',
             'nov 1953', '09 2007', 'jul 18sat 1953'
    """
    # Extract month names, then parse remaining numbers
    # Try YYYYMMDD, then positional disambiguation
    ...
Enter fullscreen mode Exit fullscreen mode

With parsed components, a contradicting birth year is a near-certain no-match signal — different people share names and addresses, but rarely share a birthdate. We weighted year contradictions at 2.5x.

We also improved phone normalization (strip country codes, compare last 10 digits) and name scoring (first-name extraction to detect gender swaps like Daniel/Danielle).

The result: 0.747 F1 — a +0.08 jump from DOB parsing alone.

Match pairs:    mean=0.899
No-match pairs: mean=0.715
Gap:            0.184
Enter fullscreen mode Exit fullscreen mode

The score gap nearly doubled. Precision jumped from 0.541 to 0.655, eliminating ~1,200 false positives.

Config 3: Classical + Vertex AI Embeddings (0.750 F1)

We embedded all 20,000 profiles using Vertex AI's text-embedding-004 (768 dimensions) and computed cosine similarity for each pair. Embeddings alone scored 0.658 F1 — worse than classical scoring because the embedding gap was only 0.062 (adversarial profiles are semantically similar by design).

But blending 65% classical + 35% embedding produced 0.750 F1 — a small but real improvement. The embedding captures semantic relationships that string matching misses (Bill/William, abbreviated addresses) while the classical scorer provides the structural discrimination (DOB parsing, exact identifier overlap).

Results

Method Precision Recall F1 Training Data Time
Random Forest (BPID paper) 0.653 0.609 0.629 Yes --
GoldenMatch classical 0.655 0.869 0.747 No 0.2s
GoldenMatch + embeddings 0.672 0.849 0.750 No ~8min
Ditto (BPID paper) 0.746 0.804 0.752 Yes --
Sudowoodo (BPID paper) 0.774 0.802 0.788 Yes --

GoldenMatch matches Ditto (0.750 vs 0.752) with zero training data. The gap to Sudowoodo (0.788) remains — fine-tuned BERT models that learn PII-specific representations still have an edge on adversarial data.

Note the precision-recall balance: GoldenMatch trades higher recall (0.849-0.869) for lower precision (0.655-0.672) compared to the PLMs. In production, this tradeoff is tunable via the threshold — at t=0.87, GoldenMatch hits 0.718 precision / 0.734 recall / 0.726 F1.

What Made the Difference

DOB parsing was the single biggest lever

Going from raw digit comparison to parsed (year, month, day) components was worth +0.08 F1. A birth year contradiction is the strongest no-match signal in PII data — stronger than different names (people change names) or different addresses (people move).

Embeddings help, but not as much as you'd think

Vertex AI embeddings added only +0.003 F1 on top of the optimized classical scorer. The reason: BPID's adversarial pairs are designed to be semantically similar. "Daniel" and "Danielle" are close in embedding space. The embedding helps most on genuine matches with unusual formatting, but can't reject the traps.

Multi-valued fields need max-over-pairs scoring

BPID profiles have lists of emails, phones, and addresses. Concatenating them into a single string and running Jaro-Winkler produces poor results. Scoring each element pair and taking the maximum matches BPID's annotation rule: "one shared element = match for that attribute."

First-name extraction catches gender swaps

BPID includes deliberate negative name pairs: Daniel/Danielle, Jon/John, Mary/Mark. The ensemble scorer gives these high similarity (~0.85+). Extracting tokens and checking that at least one name token matches well across profiles catches many of these.

LLM boost actually hurt performance

We sent 4,747 borderline pairs (hybrid score 0.66-0.86) to GPT-4.1-mini. The result surprised us: F1 dropped from 0.750 to 0.737. The LLM achieved only 60.7% accuracy on borderline pairs — barely better than random. It said "yes" to 2,646 of 4,747 pairs, creating more false positives than it eliminated.

Why? The same adversarial design that makes BPID hard for string matchers also tricks LLMs. Two profiles with the same name, similar emails, and overlapping phone numbers look like a match to a language model — it can't reliably detect that the birthdates contradict or that the phone numbers differ by exactly the last four digits. The BPID paper observed the same pattern: even GPT-4 Turbo and Claude 3 Sonnet make systematic errors on digit comparison because tokenization obscures exact digit counts.

The lesson: on adversarial PII data, structured feature engineering (parsing dates into components, normalizing phone numbers, checking first-name tokens) outperforms LLM reasoning. The LLM adds value on real-world data where the challenge is variety, not adversarial traps.

Running This Yourself

pip install goldenmatch
Enter fullscreen mode Exit fullscreen mode

Download BPID from Zenodo (Apache 2.0 license, 58MB).


from goldenmatch.core.scorer import score_field
from goldenmatch.utils.transforms import apply_transforms

# Load BPID matching dataset
pairs, labels = [], []
with open("matching_dataset.jsonl") as f:
    for line in f:
        d = json.loads(line)
        pairs.append((d["profile1"], d["profile2"]))
        labels.append(d["match"] == "True")

# Score a pair with GoldenMatch primitives
name_a = apply_transforms("corrie arreola", ["lowercase", "strip"])
name_b = apply_transforms("arreola corrie", ["lowercase", "strip"])
score = score_field(name_a, name_b, "token_sort")  # 1.0 — handles reordering
Enter fullscreen mode Exit fullscreen mode

The full benchmark scripts (naive, optimized, embedding, LLM boost) are in the bpid_bench directory.

Where GoldenMatch Fits

BPID is a pair classification benchmark — given two profiles, decide match or no-match. GoldenMatch is built for a different task: given a table of N records, find all duplicate clusters. The pair scoring approach here uses GoldenMatch's primitives outside their normal pipeline context.

For production PII deduplication, GoldenMatch's pipeline adds:

  • Blocking — reduces O(N^2) comparisons to manageable candidate sets
  • Clustering (Union-Find) — produces transitive groups, not just pairwise decisions
  • Golden rules — merges clusters into canonical records
  • LLM calibration — handles borderline pairs for ~$0.01

On structured data with standard blocking keys, GoldenMatch hits 97.2% F1 on DBLP-ACM and processes 401K records in under 30 seconds.

BPID tests a specific, adversarial corner: PII profiles with intentional near-miss traps. GoldenMatch matches Ditto's F1 without training data — and the classical scorer runs in 0.2 seconds.

Key Takeaways

  • GoldenMatch scores 0.750 F1 on BPID — matching Ditto (0.752), above Random Forest (0.629)
  • Zero training data — no labeled pairs, no fine-tuning, no GPU training
  • DOB parsing was the biggest win — proper date component extraction added +0.08 F1
  • Embeddings provide marginal gains — Vertex AI embeddings added +0.003 F1 over classical scoring
  • 0.2 seconds for classical scoring — 41,000+ pairs/sec on a laptop
  • The precision-recall tradeoff is tunable — adjust threshold for your use case

Try GoldenMatch on your own data: pip install goldenmatch or try it in the playground.


Originally published at https://bensevern.dev

Top comments (0)