Deduplicating records is a recurring problem in data engineering and several challenges make it difficult: scale, surface-level variation, context-dependent equivalence and world knowledge.
Let's look at an example. We wanted to build a database of AI researchers from academic lab websites. Scraping produced:
- Name variations: "Julie Kallini" vs "J. Kallini", "Moscato, Vincenzo" vs "Vincenzo Moscato"
- Typos: "Namoi Saphra" vs "Naomi Saphra", "Bryan Wiledr" vs "Bryan Wilder"
- Career changes: Same person listed at "AUTON Lab" and later at "AUTON Lab (Former)" with different emails
-
GitHub handles: Sometimes the only reliable link between records—"A. Butoi" and "Alexandra Butoi" sharing
butoialexandra - Username-only names: Researchers who listed their GitHub handle ("smirchan", "VSAnimator") instead of their real name
We used a dataset of 200 researcher profiles scraped from academic lab websites. It was then manually reviewed to establish ground-truth clusters, which we used for evaluation.
The data covers name, position, organisation, email, university, and GitHub. GitHub handles are present in ~40% of rows and act as a high-precision but low-recall signal.
| row_id | name | position | organisation | university | github | |
|---|---|---|---|---|---|---|
| 2 | A. Butoi | PhD Student | Rycolab | alexandra.butoi@personal.edu | ETH Zurich | butoialexandra |
| 8 | Alexandra Butoi | — | Ryoclab | — | — | butoialexandra |
| 43 | Namoi Saphra | Research Fellow | — | nsaphra@alumni | - | nsaphra |
| 47 | Naomi Saphra | — | Harvard / BU / EleutherAI | nsaphra@fas.harvard.edu | — | nsaphra |
| 18 | T. Gupta | PhD Student | AUTON Lab (Former) | — | Carnegie Mellon | tejus-gupta |
| 26 | Tejus Gupta | PhD Student | AUTON Lab | tejusg@cs.cmu.edu | Carnegie Mellon | tejus-gupta |
| 55 | smirchan | PhD Student | — | suvir@yahoo.com | Stanford University | smirchan |
| 155 | Suvir Mirchandani | PhD Student | Stanford CRFM | suvir@cs.stanford.edu | — | smirchan |
| 98 | Vincenzo Moscato | Full Professor | — | vincenzo.moscato@unina.it | — | — |
| 133 | Moscato, Vincenzo | Full Professor | University of Naples | vincenzo.moscato@yahoo.com | — | — |
A go-to approach is fuzzy string matching using libraries like fuzzywuzzy or rapidfuzz. However, these suffer from the threshold problem: set it too low and you catch false positives; set it too high and you miss semantic duplicates like "A. Butoi" ↔ "Alexandra Butoi" which have low character overlap despite being the same person. The alternative is manual review, but with 200 rows requiring ~5 comparisons each, that's hours of tedious work.
We benchmarked fuzzy string matching as a baseline. It compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering (a graph algorithm that efficiently merges items into equivalence classes by treating each match as an edge).
| Metric | Fuzzy (t=0.75) | Fuzzy (t=0.90) |
|---|---|---|
| Row accuracy | 86% | 82% |
| Cluster accuracy | 82% | 78% |
| Easy duplicates | 58% (7/12) | 17% (2/12) |
| Hard duplicates | 70% (7/10) | 10% (1/10) |
| Distractor accuracy | 90% (18/20) | 100% (20/20) |
| Singletons | 90% (90/100) | 100% (100/100) |
| Processing time | 0.04s | 0.04s |
| Cost | $0 | $0 |
At t=0.75 it catches more duplicates but risks false merges. At t=0.90 it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta".
We next wanted to try ChatGPT. We upload the CSV and asked it to deduplicate.
| Metric | ChatGPT |
|---|---|
| Row accuracy | 56% |
| Cluster accuracy | 45% (72/160) |
| Easy duplicates | 100% (12/12) |
| Hard duplicates | 70% (7/10) |
| Distractor accuracy | 25% (5/20) |
| Singletons | 33% (33/100) |
| Output rows | 72 |
| Data loss (over-merged) | 88 clusters |
ChatGPT over-merged:
- 88 clusters lost — unique people incorrectly merged into other records
- Only 33% of singletons preserved — people with no duplicates were merged into unrelated records
- Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged
Let's now present everyrow.io/dedupe! Instead of relying on string similarity thresholds, it uses LLMs to make contextual judgments about whether two records represent the same entity.
The system exposes a high-level deduplication operation that accepts a dataset and a natural-language equivalence definition. The equivalence relation can be as descriptive as needed and could also include examples.
from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd
input_df = pd.read_csv("researchers.csv")
async with create_client() as client:
async with create_session(client, name="Researcher Dedupe") as session:
result = await dedupe(
session=session,
input=input_df,
equivalence_relation=(
"Two rows are duplicates if they represent the same person "
"despite different email/organization (career changes). "
"Consider name variations like typos, nicknames (Robert/Bob), "
"and format differences (John Smith/J. Smith)."
),
)
result.data.to_csv("deduplicated.csv", index=False)
Accuracy was evaluated by comparing predicted equivalence classes against manually labeled ground truth. We report both row-level accuracy (whether a row is assigned to the correct cluster) and cluster-level accuracy (whether an entire entity cluster is correctly reconstructed).
| Metric | Fuzzy (t=0.75) | Fuzzy (t=0.90) | ChatGPT | everyrow.io/dedupe |
|---|---|---|---|---|
| Row accuracy | 86% | 82% | 56% | 98% |
| Cluster accuracy | 82% | 78% | 45% | 97.5% |
| Easy duplicates | 58% (7/12) | 17% (2/12) | 100% (12/12) | 100% (12/12) |
| Hard duplicates | 70% (7/10) | 10% (1/10) | 70% (7/10) | 100% (10/10) |
| Distractor accuracy | 90% (18/20) | 100% (20/20) | 25% (5/20) | 95% (19/20) |
| Singletons | 90% (90/100) | 100% (100/100) | 33% (33/100) | 100% (100/100) |
| Processing time | 0.04s | 0.04s | NA | 90s |
| Cost | $0 | $0 | NA | $0.42 |
A few examples from everyrow.io/dedupe. Starting with some found matches:
✓ Match: Name abbreviation + org typo
-
Row 2: "A. Butoi" — Rycolab, ETH Zurich,
butoialexandra -
Row 8: "Alexandra Butoi" — Ryoclab (typo),
butoialexandra
✓ Match: Typo in first name
-
Row 43: "Namoi Saphra" —
nsaphra -
Row 47: "Naomi Saphra" — Harvard/BU/EleutherAI,
nsaphra
✓ Match: Career transition
-
Row 18: "T. Gupta" — AUTON Lab (Former),
tejus-gupta -
Row 26: "Tejus Gupta" — AUTON Lab,
tejus-gupta
✓ Match: Username-only name
-
Row 55: "smirchan" — Stanford University,
smirchan -
Row 155: "Suvir Mirchandani" — Stanford CRFM,
smirchan
✗ Correctly identified as different people:
-
Row 6: "Rohan Saha" — Alberta,
simpleParadox -
Row 141: "Rohan Chandra" — UT Austin,
rohanchandra30
And the errors made:
⚠ Over-merged: Same institution
- "Sarah Ball" and "Wen (Lavine) Lai" — both at MCML, PhD students
⚠ Over-merged: Co-authors
- "Marwa Abdulhai" and "Tejus Gupta" — they co-authored a paper
⚠ Over-merged: Co-authors + username names
- "Suvir Mirchandani", "Igor Oliveira", and "Vishnu Sarukkai" — all three co-authored the same paper; username-only names made disambiguation harder
How does it work?
The system implements a multi-stage deduplication pipeline designed to reduce pairwise comparisons while preserving semantic recall.
Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation.
Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared.
Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity.
Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph.
Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi").
The tradeoff: fuzzy matching is 2000x faster and free, but has a 12-16% accuracy gap. For datasets where false merges are costly, the LLM-based approach may be worth the additional runtime and cost.
- 98% row-level accuracy on a dataset with conflicting signals
- 90 seconds processing time and $0.42 LLM cost for 200 records
- 4 false positive clusters due to co-authorship signals and shared institution
This approach is most appropriate when:
- Semantic judgment is required: Name variations, abbreviations, nicknames
- Conflicting signals exist: Same person with different emails/organisations over time
- No single reliable key: Can't rely on email or ID alone
Use it yourself!
- Obtain an API key at everyrow.io
- Install the SDK:
uv pip install everyrowor visit the github page: https://github.com/futuresearch/everyrow-sdk - Define your equivalence relation in natural language
- Compare results against your ground-truth labels
Top comments (0)