Rafael Poyiadzi

Posted on Jan 22

Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

#data #dataengineering #llm #showdev

Deduplicating records is a recurring problem in data engineering and several challenges make it difficult: scale, surface-level variation, context-dependent equivalence and world knowledge.

Let's look at an example. We wanted to build a database of AI researchers from academic lab websites. Scraping produced:

Name variations: "Julie Kallini" vs "J. Kallini", "Moscato, Vincenzo" vs "Vincenzo Moscato"
Typos: "Namoi Saphra" vs "Naomi Saphra", "Bryan Wiledr" vs "Bryan Wilder"
Career changes: Same person listed at "AUTON Lab" and later at "AUTON Lab (Former)" with different emails
GitHub handles: Sometimes the only reliable link between records—"A. Butoi" and "Alexandra Butoi" sharing butoialexandra
Username-only names: Researchers who listed their GitHub handle ("smirchan", "VSAnimator") instead of their real name

We used a dataset of 200 researcher profiles scraped from academic lab websites. It was then manually reviewed to establish ground-truth clusters, which we used for evaluation.

The data covers name, position, organisation, email, university, and GitHub. GitHub handles are present in ~40% of rows and act as a high-precision but low-recall signal.

row_id	name	position	organisation	email	university	github
2	A. Butoi	PhD Student	Rycolab	alexandra.butoi@personal.edu	ETH Zurich	`butoialexandra`
8	Alexandra Butoi	—	Ryoclab	—	—	`butoialexandra`
43	Namoi Saphra	Research Fellow	—	nsaphra@alumni	-	`nsaphra`
47	Naomi Saphra	—	Harvard / BU / EleutherAI	nsaphra@fas.harvard.edu	—	`nsaphra`
18	T. Gupta	PhD Student	AUTON Lab (Former)	—	Carnegie Mellon	`tejus-gupta`
26	Tejus Gupta	PhD Student	AUTON Lab	tejusg@cs.cmu.edu	Carnegie Mellon	`tejus-gupta`
55	smirchan	PhD Student	—	suvir@yahoo.com	Stanford University	`smirchan`
155	Suvir Mirchandani	PhD Student	Stanford CRFM	suvir@cs.stanford.edu	—	`smirchan`
98	Vincenzo Moscato	Full Professor	—	vincenzo.moscato@unina.it	—	—
133	Moscato, Vincenzo	Full Professor	University of Naples	vincenzo.moscato@yahoo.com	—	—

A go-to approach is fuzzy string matching using libraries like fuzzywuzzy or rapidfuzz. However, these suffer from the threshold problem: set it too low and you catch false positives; set it too high and you miss semantic duplicates like "A. Butoi" ↔ "Alexandra Butoi" which have low character overlap despite being the same person. The alternative is manual review, but with 200 rows requiring ~5 comparisons each, that's hours of tedious work.

We benchmarked fuzzy string matching as a baseline. It compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering (a graph algorithm that efficiently merges items into equivalence classes by treating each match as an edge).

Metric	Fuzzy (t=0.75)	Fuzzy (t=0.90)
Row accuracy	86%	82%
Cluster accuracy	82%	78%
Easy duplicates	58% (7/12)	17% (2/12)
Hard duplicates	70% (7/10)	10% (1/10)
Distractor accuracy	90% (18/20)	100% (20/20)
Singletons	90% (90/100)	100% (100/100)
Processing time	0.04s	0.04s
Cost	$0	$0

At t=0.75 it catches more duplicates but risks false merges. At t=0.90 it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta".

We next wanted to try ChatGPT. We upload the CSV and asked it to deduplicate.

Metric	ChatGPT
Row accuracy	56%
Cluster accuracy	45% (72/160)
Easy duplicates	100% (12/12)
Hard duplicates	70% (7/10)
Distractor accuracy	25% (5/20)
Singletons	33% (33/100)
Output rows	72
Data loss (over-merged)	88 clusters

ChatGPT over-merged:

88 clusters lost — unique people incorrectly merged into other records
Only 33% of singletons preserved — people with no duplicates were merged into unrelated records
Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged

Let's now present everyrow.io/dedupe! Instead of relying on string similarity thresholds, it uses LLMs to make contextual judgments about whether two records represent the same entity.

The system exposes a high-level deduplication operation that accepts a dataset and a natural-language equivalence definition. The equivalence relation can be as descriptive as needed and could also include examples.

from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd

input_df = pd.read_csv("researchers.csv")

async with create_client() as client:
    async with create_session(client, name="Researcher Dedupe") as session:
        result = await dedupe(
            session=session,
            input=input_df,
            equivalence_relation=(
                "Two rows are duplicates if they represent the same person "
                "despite different email/organization (career changes). "
                "Consider name variations like typos, nicknames (Robert/Bob), "
                "and format differences (John Smith/J. Smith)."
            ),
        )
        result.data.to_csv("deduplicated.csv", index=False)

Accuracy was evaluated by comparing predicted equivalence classes against manually labeled ground truth. We report both row-level accuracy (whether a row is assigned to the correct cluster) and cluster-level accuracy (whether an entire entity cluster is correctly reconstructed).

Metric	Fuzzy (t=0.75)	Fuzzy (t=0.90)	ChatGPT	everyrow.io/dedupe
Row accuracy	86%	82%	56%	98%
Cluster accuracy	82%	78%	45%	97.5%
Easy duplicates	58% (7/12)	17% (2/12)	100% (12/12)	100% (12/12)
Hard duplicates	70% (7/10)	10% (1/10)	70% (7/10)	100% (10/10)
Distractor accuracy	90% (18/20)	100% (20/20)	25% (5/20)	95% (19/20)
Singletons	90% (90/100)	100% (100/100)	33% (33/100)	100% (100/100)
Processing time	0.04s	0.04s	NA	90s
Cost	$0	$0	NA	$0.42

A few examples from everyrow.io/dedupe. Starting with some found matches:

✓ Match: Name abbreviation + org typo

Row 2: "A. Butoi" — Rycolab, ETH Zurich, butoialexandra
Row 8: "Alexandra Butoi" — Ryoclab (typo), butoialexandra

✓ Match: Typo in first name

Row 43: "Namoi Saphra" — nsaphra
Row 47: "Naomi Saphra" — Harvard/BU/EleutherAI, nsaphra

✓ Match: Career transition

Row 18: "T. Gupta" — AUTON Lab (Former), tejus-gupta
Row 26: "Tejus Gupta" — AUTON Lab, tejus-gupta

✓ Match: Username-only name

Row 55: "smirchan" — Stanford University, smirchan
Row 155: "Suvir Mirchandani" — Stanford CRFM, smirchan

✗ Correctly identified as different people:

Row 6: "Rohan Saha" — Alberta, simpleParadox
Row 141: "Rohan Chandra" — UT Austin, rohanchandra30

And the errors made:

⚠ Over-merged: Same institution

"Sarah Ball" and "Wen (Lavine) Lai" — both at MCML, PhD students

⚠ Over-merged: Co-authors

"Marwa Abdulhai" and "Tejus Gupta" — they co-authored a paper

⚠ Over-merged: Co-authors + username names

"Suvir Mirchandani", "Igor Oliveira", and "Vishnu Sarukkai" — all three co-authored the same paper; username-only names made disambiguation harder

How does it work?

The system implements a multi-stage deduplication pipeline designed to reduce pairwise comparisons while preserving semantic recall.

Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation.
Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared.
Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity.
Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph.
Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi").

The tradeoff: fuzzy matching is 2000x faster and free, but has a 12-16% accuracy gap. For datasets where false merges are costly, the LLM-based approach may be worth the additional runtime and cost.

98% row-level accuracy on a dataset with conflicting signals
90 seconds processing time and $0.42 LLM cost for 200 records
4 false positive clusters due to co-authorship signals and shared institution

This approach is most appropriate when:

Semantic judgment is required: Name variations, abbreviations, nicknames
Conflicting signals exist: Same person with different emails/organisations over time
No single reliable key: Can't rely on email or ID alone

Use it yourself!

Obtain an API key at everyrow.io
Install the SDK: uv pip install everyrow or visit the github page: https://github.com/futuresearch/everyrow-sdk
Define your equivalence relation in natural language
Compare results against your ground-truth labels

DEV Community

Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

How does it work?

Use it yourself!

Top comments (0)