DEV Community

Rafael Poyiadzi
Rafael Poyiadzi

Posted on

Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

Deduplicating records is a recurring problem in data engineering and several challenges make it difficult: scale, surface-level variation, context-dependent equivalence and world knowledge.

Let's look at an example. We wanted to build a database of AI researchers from academic lab websites. Scraping produced:

  • Name variations: "Julie Kallini" vs "J. Kallini", "Moscato, Vincenzo" vs "Vincenzo Moscato"
  • Typos: "Namoi Saphra" vs "Naomi Saphra", "Bryan Wiledr" vs "Bryan Wilder"
  • Career changes: Same person listed at "AUTON Lab" and later at "AUTON Lab (Former)" with different emails
  • GitHub handles: Sometimes the only reliable link between records—"A. Butoi" and "Alexandra Butoi" sharing butoialexandra
  • Username-only names: Researchers who listed their GitHub handle ("smirchan", "VSAnimator") instead of their real name

We used a dataset of 200 researcher profiles scraped from academic lab websites. It was then manually reviewed to establish ground-truth clusters, which we used for evaluation.

The data covers name, position, organisation, email, university, and GitHub. GitHub handles are present in ~40% of rows and act as a high-precision but low-recall signal.

row_id name position organisation email university github
2 A. Butoi PhD Student Rycolab alexandra.butoi@personal.edu ETH Zurich butoialexandra
8 Alexandra Butoi Ryoclab butoialexandra
43 Namoi Saphra Research Fellow nsaphra@alumni - nsaphra
47 Naomi Saphra Harvard / BU / EleutherAI nsaphra@fas.harvard.edu nsaphra
18 T. Gupta PhD Student AUTON Lab (Former) Carnegie Mellon tejus-gupta
26 Tejus Gupta PhD Student AUTON Lab tejusg@cs.cmu.edu Carnegie Mellon tejus-gupta
55 smirchan PhD Student suvir@yahoo.com Stanford University smirchan
155 Suvir Mirchandani PhD Student Stanford CRFM suvir@cs.stanford.edu smirchan
98 Vincenzo Moscato Full Professor vincenzo.moscato@unina.it
133 Moscato, Vincenzo Full Professor University of Naples vincenzo.moscato@yahoo.com

A go-to approach is fuzzy string matching using libraries like fuzzywuzzy or rapidfuzz. However, these suffer from the threshold problem: set it too low and you catch false positives; set it too high and you miss semantic duplicates like "A. Butoi" ↔ "Alexandra Butoi" which have low character overlap despite being the same person. The alternative is manual review, but with 200 rows requiring ~5 comparisons each, that's hours of tedious work.

We benchmarked fuzzy string matching as a baseline. It compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering (a graph algorithm that efficiently merges items into equivalence classes by treating each match as an edge).

Metric Fuzzy (t=0.75) Fuzzy (t=0.90)
Row accuracy 86% 82%
Cluster accuracy 82% 78%
Easy duplicates 58% (7/12) 17% (2/12)
Hard duplicates 70% (7/10) 10% (1/10)
Distractor accuracy 90% (18/20) 100% (20/20)
Singletons 90% (90/100) 100% (100/100)
Processing time 0.04s 0.04s
Cost $0 $0

At t=0.75 it catches more duplicates but risks false merges. At t=0.90 it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta".

We next wanted to try ChatGPT. We upload the CSV and asked it to deduplicate.

Metric ChatGPT
Row accuracy 56%
Cluster accuracy 45% (72/160)
Easy duplicates 100% (12/12)
Hard duplicates 70% (7/10)
Distractor accuracy 25% (5/20)
Singletons 33% (33/100)
Output rows 72
Data loss (over-merged) 88 clusters

ChatGPT over-merged:

  • 88 clusters lost — unique people incorrectly merged into other records
  • Only 33% of singletons preserved — people with no duplicates were merged into unrelated records
  • Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged

Let's now present everyrow.io/dedupe! Instead of relying on string similarity thresholds, it uses LLMs to make contextual judgments about whether two records represent the same entity.

The system exposes a high-level deduplication operation that accepts a dataset and a natural-language equivalence definition. The equivalence relation can be as descriptive as needed and could also include examples.

from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd

input_df = pd.read_csv("researchers.csv")

async with create_client() as client:
    async with create_session(client, name="Researcher Dedupe") as session:
        result = await dedupe(
            session=session,
            input=input_df,
            equivalence_relation=(
                "Two rows are duplicates if they represent the same person "
                "despite different email/organization (career changes). "
                "Consider name variations like typos, nicknames (Robert/Bob), "
                "and format differences (John Smith/J. Smith)."
            ),
        )
        result.data.to_csv("deduplicated.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

Accuracy was evaluated by comparing predicted equivalence classes against manually labeled ground truth. We report both row-level accuracy (whether a row is assigned to the correct cluster) and cluster-level accuracy (whether an entire entity cluster is correctly reconstructed).

Metric Fuzzy (t=0.75) Fuzzy (t=0.90) ChatGPT everyrow.io/dedupe
Row accuracy 86% 82% 56% 98%
Cluster accuracy 82% 78% 45% 97.5%
Easy duplicates 58% (7/12) 17% (2/12) 100% (12/12) 100% (12/12)
Hard duplicates 70% (7/10) 10% (1/10) 70% (7/10) 100% (10/10)
Distractor accuracy 90% (18/20) 100% (20/20) 25% (5/20) 95% (19/20)
Singletons 90% (90/100) 100% (100/100) 33% (33/100) 100% (100/100)
Processing time 0.04s 0.04s NA 90s
Cost $0 $0 NA $0.42

A few examples from everyrow.io/dedupe. Starting with some found matches:

✓ Match: Name abbreviation + org typo

  • Row 2: "A. Butoi" — Rycolab, ETH Zurich, butoialexandra
  • Row 8: "Alexandra Butoi" — Ryoclab (typo), butoialexandra

✓ Match: Typo in first name

  • Row 43: "Namoi Saphra" — nsaphra
  • Row 47: "Naomi Saphra" — Harvard/BU/EleutherAI, nsaphra

✓ Match: Career transition

  • Row 18: "T. Gupta" — AUTON Lab (Former), tejus-gupta
  • Row 26: "Tejus Gupta" — AUTON Lab, tejus-gupta

✓ Match: Username-only name

  • Row 55: "smirchan" — Stanford University, smirchan
  • Row 155: "Suvir Mirchandani" — Stanford CRFM, smirchan

✗ Correctly identified as different people:

  • Row 6: "Rohan Saha" — Alberta, simpleParadox
  • Row 141: "Rohan Chandra" — UT Austin, rohanchandra30

And the errors made:

⚠ Over-merged: Same institution

  • "Sarah Ball" and "Wen (Lavine) Lai" — both at MCML, PhD students

⚠ Over-merged: Co-authors

  • "Marwa Abdulhai" and "Tejus Gupta" — they co-authored a paper

⚠ Over-merged: Co-authors + username names

  • "Suvir Mirchandani", "Igor Oliveira", and "Vishnu Sarukkai" — all three co-authored the same paper; username-only names made disambiguation harder

How does it work?

The system implements a multi-stage deduplication pipeline designed to reduce pairwise comparisons while preserving semantic recall.

  1. Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation.

  2. Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared.

  3. Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity.

  4. Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph.

  5. Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi").

The tradeoff: fuzzy matching is 2000x faster and free, but has a 12-16% accuracy gap. For datasets where false merges are costly, the LLM-based approach may be worth the additional runtime and cost.

  • 98% row-level accuracy on a dataset with conflicting signals
  • 90 seconds processing time and $0.42 LLM cost for 200 records
  • 4 false positive clusters due to co-authorship signals and shared institution

This approach is most appropriate when:

  • Semantic judgment is required: Name variations, abbreviations, nicknames
  • Conflicting signals exist: Same person with different emails/organisations over time
  • No single reliable key: Can't rely on email or ID alone

Use it yourself!

  1. Obtain an API key at everyrow.io
  2. Install the SDK: uv pip install everyrow or visit the github page: https://github.com/futuresearch/everyrow-sdk
  3. Define your equivalence relation in natural language
  4. Compare results against your ground-truth labels

Top comments (0)