DEV Community

Cover image for AI-Powered Deduplication: How LLMs Supercharge the Golden Suite
benzsevern
benzsevern

Posted on • Originally published at bensevern.dev

AI-Powered Deduplication: How LLMs Supercharge the Golden Suite

You have 52,288 school records from the UK government's Get Information About Schools register. Half are open, half are closed. Schools that converted to academies appear twice — once as the closed LA school and once as the new academy, same postcode, slightly different name. "Kingsgate Junior School" (Closed) becomes "Kingsgate Primary School" (Open). "Primrose Hill Infant School" (Closed) becomes "Primrose Hill School" (Open).

GoldenMatch's fuzzy matcher catches these. But it also finds 820 data quality findings, some of which are noise. And borderline pairs — "The Hall School" appearing at three different postcodes — need human-like judgment to sort out.

That's where LLMs come in.

What LLM Boost Does

Three Golden Suite tools have optional LLM integration. Each solves a different problem:

Tool Feature What It Does Cost (52K rows)
GoldenCheck scan_file_with_llm() Catches data quality issues profilers miss, upgrades severity on real problems ~$0.01
GoldenFlow category_llm_correct Corrects misspelled categories via LLM ~$0.005
GoldenMatch llm_scorer=True Resolves borderline duplicate pairs the fuzzy matcher can't decide on ~$0.05

Every LLM feature is opt-in, provider-agnostic (Anthropic or OpenAI), and degrades gracefully — if the API call fails or budget runs out, you still get your non-LLM results.

Setup

1. Install with LLM extras

pip install goldencheck[llm] goldenflow[llm] goldenmatch[llm]
Enter fullscreen mode Exit fullscreen mode

Or install everything at once with GoldenPipe:

pip install goldenpipe[all]
Enter fullscreen mode Exit fullscreen mode

2. Set your API key

Pick one provider. Both work across all three tools:

# Option A: Anthropic (default, recommended)
export ANTHROPIC_API_KEY="sk-ant-..."

# Option B: OpenAI
export OPENAI_API_KEY="sk-..."
Enter fullscreen mode Exit fullscreen mode

That's it. Every tool auto-detects the provider from your environment.

3. Get the data

Download the full UK schools register (52,288 records, 135 columns):

curl -O https://golden-suite.s3.us-east-2.amazonaws.com/samples/uk_schools.csv
Enter fullscreen mode Exit fullscreen mode

Or download directly from the GIAS website (select "Establishment fields").

Step 1: GoldenCheck — Smarter Data Profiling

GoldenCheck scans your CSV for data quality issues: nulls, format inconsistencies, outliers, encoding problems. The standard profilers are fast and free, but they work on statistical patterns. They don't understand what the data means.

scan_file_with_llm() adds a two-stage enhancement pass after the profilers finish:

  1. Semantic type classification — the LLM reads column names and sample values to identify types like email, phone, currency, postcode. This catches columns the profiler mislabeled.
  2. Finding review — the LLM reviews the profiler's findings and can upgrade severity, downgrade false positives, or flag cross-column issues the profilers missed entirely.

Python


# Standard scan (free, fast)
findings, profile = goldencheck.scan_file("uk_schools.csv")
print(f"Profiler found {len(findings)} issues")

# LLM-boosted scan
findings_llm, profile_llm = goldencheck.scan_file_with_llm("uk_schools.csv", provider="openai")
print(f"LLM-boosted found {len(findings_llm)} issues")
Enter fullscreen mode Exit fullscreen mode

Real results on UK schools data

We ran both scans on the full 52,288-row dataset:

Metric Standard LLM-Boosted
Total findings 820 843
Errors 0 6
Warnings 160 177
Info 660 660
Time 8.3s 61.5s

The LLM added 23 new findings, including 6 errors the profiler completely missed:

  • Invalid locality values with numeric characters in the Locality column
  • Invalid town names with embedded numbers in the Town column
  • Invalid establishment names with numeric characters in EstablishmentName
  • Invalid head teacher names with embedded numbers in HeadFirstName and HeadLastName
  • Invalid provider names in PropsName

The profiler saw these columns as "high cardinality text" and scored everything as info-level. The LLM understood these are proper name fields and flagged numeric contamination as errors.

Note: with 135 columns, GoldenCheck limits the LLM pass to the 50 columns with the most findings to control cost.

Cost control

Set a budget cap to prevent surprises:

export GOLDENCHECK_LLM_BUDGET=0.10  # max $0.10 per scan
Enter fullscreen mode Exit fullscreen mode

If the estimated cost exceeds your budget, GoldenCheck skips the LLM pass and returns profiler-only results with a warning.

Model defaults

Provider Default Model Input (per 1K tokens) Output (per 1K tokens)
Anthropic claude-haiku-4-5 $0.0008 $0.004
OpenAI gpt-4o-mini $0.00015 $0.0006

Step 2: GoldenFlow — Deterministic + LLM Transforms

GoldenFlow transforms messy data: trimming whitespace, normalizing phone numbers, standardizing case. On the UK schools data, the standard (non-LLM) transforms are impressive on their own:


df = pl.read_csv("uk_schools.csv", encoding="utf8-lossy", ignore_errors=True)
result = goldenflow.transform_df(df)
print(f"Transforms applied: {len(result.manifest.records)}")
Enter fullscreen mode Exit fullscreen mode

Real results

Transform Count
strip 111
normalize_unicode 111
normalize_quotes 111
collapse_whitespace 111
null_standardize 111
category_auto_correct 96
strip_titles 6
date_iso8601 6
phone_e164 2
zip_normalize 1
Total 666 transforms, 204,283 rows affected in 17s

For LLM-enhanced categorical correction, you can target specific columns using a config:

from goldenflow.config.schema import GoldenFlowConfig, TransformSpec

config = GoldenFlowConfig(transforms=[
    TransformSpec(column="TypeOfEstablishment (name)", ops=["strip", "category_llm_correct"]),
    TransformSpec(column="PhaseOfEducation (name)", ops=["strip", "category_llm_correct"]),
])
result = goldenflow.transform_df(df, config=config)
Enter fullscreen mode Exit fullscreen mode

On this dataset, the LLM corrector found 0 additional fixes — the UK government uses controlled vocabularies, so the categorical columns are already clean. On messier real-world data (CRM exports, manually-entered forms), the LLM corrector catches misspellings and abbreviations that deterministic transforms miss.

Step 3: GoldenMatch — LLM Scoring for Borderline Pairs

This is where LLMs have the biggest impact. GoldenMatch uses blocking, fuzzy string matching, and weighted scoring to find duplicate records. Most pairs are clear — either an obvious match (score > 0.95) or an obvious non-match (score < 0.75). But the pairs in between need judgment.

LLM scoring sends only the borderline pairs to the LLM. This keeps costs low while improving accuracy on the hard cases.

How it works

Score > 0.95  → Auto-accept (no LLM call)
Score 0.75-0.95 → Send to LLM for match/no-match decision
Score < 0.75  → Auto-reject (no LLM call)
Enter fullscreen mode Exit fullscreen mode

Python — simple


df = pl.read_csv("uk_schools.csv", encoding="utf8-lossy", ignore_errors=True)
df = df.cast({col: pl.Utf8 for col in df.columns})  # prevent schema mismatch

result = goldenmatch.dedupe_df(
    df,
    exact=["Postcode"],
    fuzzy={"EstablishmentName": 0.8, "Town": 0.7},
    llm_scorer=True,
)
print(f"Clusters: {result.total_clusters}")
Enter fullscreen mode Exit fullscreen mode

Python — full config with LLM scorer

from goldenmatch.config.schemas import (
    GoldenMatchConfig, LLMScorerConfig, BudgetConfig,
    MatchkeyConfig, MatchkeyField,
    BlockingConfig, BlockingKeyConfig,
)

config = GoldenMatchConfig(
    blocking=BlockingConfig(
        keys=[BlockingKeyConfig(fields=["Postcode"], transforms=["strip"])],
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(fields=["Postcode"], transforms=["strip"]),
            BlockingKeyConfig(fields=["EstablishmentName"], transforms=["soundex"]),
        ],
    ),
    matchkeys=[MatchkeyConfig(
        name="school_identity",
        type="weighted",
        threshold=0.70,
        fields=[
            MatchkeyField(field="EstablishmentName", scorer="ensemble", weight=1.5, transforms=["lowercase", "strip"]),
            MatchkeyField(field="Postcode", scorer="jaro_winkler", weight=1.0, transforms=["strip"]),
            MatchkeyField(field="Town", scorer="jaro_winkler", weight=0.5, transforms=["lowercase", "strip"]),
        ],
    )],
    llm_scorer=LLMScorerConfig(
        enabled=True,
        mode="pairwise",
        provider="openai",
        auto_threshold=0.95,
        candidate_lo=0.75,
        candidate_hi=0.95,
        batch_size=20,
        budget=BudgetConfig(
            max_cost_usd=0.50,
            max_calls=200,
            warn_at_pct=80,
            escalation_model="gpt-4o",
            escalation_band=[0.80, 0.90],
            escalation_budget_pct=20,
        ),
    ),
)

result = goldenmatch.dedupe_df(df, config=config)
Enter fullscreen mode Exit fullscreen mode

Real results on UK schools data

We ran two configurations on the full 52,288-row dataset:

Config Clusters Records in Clusters Time
Simple (exact postcode + fuzzy name) 13,788 34,021 158s
Weighted + LLM scorer 3,475 47,859 640s

The simple config blocks on exact postcode matches and finds 13,788 clusters — but includes false positives where different schools share a postcode. The weighted config with multi-pass blocking (postcode + soundex) casts a wider net and uses the LLM scorer to adjudicate borderline pairs.

Example duplicate clusters found

These are real matches from the UK schools register:

Cluster Records Pattern
Kingsgate Junior School (Closed) ↔ Kingsgate Primary School (Open) Same postcode NW6 4LB Academisation
Primrose Hill Infant School (Closed) ↔ Primrose Hill School (Open) Same postcode NW1 8JL School merger
Torriano Infant School (Closed) ↔ Torriano Primary School (Open) Same postcode NW5 2SJ Phase consolidation
Edith Moorhouse Primary School (Closed) ↔ Edith Moorhouse Primary School (Open) Same postcode OX18 3HP Reopened under new trust

Schools that academised appear twice — once as the closed LA school and once as the new academy. The LLM scorer understands that "St Mary's Church of England Primary School" and "St Mary's CE Academy" at the same postcode are the same institution, even when the fuzzy score is borderline.

Cluster size distribution

Size 1:  18,267 singletons (unique schools)
Size 2:  10,421 pairs (most common — closed/open pairs)
Size 3:   2,188 triples
Size 4:     941 quads
Size 5+:    240 larger clusters
Enter fullscreen mode Exit fullscreen mode

Important caveat: this dataset has no ground truth labels for duplicates. The cluster counts are what GoldenMatch reports, not verified precision. Schools sharing a postcode aren't necessarily duplicates — use the review queue (goldenmatch review) to verify a sample before trusting the numbers in production.

Budget tracking

GoldenMatch tracks every API call and stops when the budget runs out:

INFO  LLM scorer: auto-accept (>0.95), candidates (0.75-0.95), below threshold
INFO  LLM cost: $0.03 (budget remaining: 94%)
Enter fullscreen mode Exit fullscreen mode

The budget has two tiers:

  1. Standard model (gpt-4o-mini or claude-haiku) — handles most borderline pairs cheaply
  2. Escalation model (gpt-4o or claude-sonnet) — reserved for the hardest pairs in the 0.80-0.90 range

Step 4: GoldenPipe — Full Pipeline

GoldenPipe orchestrates all three tools. On the UK schools data:


gp = goldenpipe.Pipeline()
result = gp.run("uk_schools.csv")

print(f"Status: {result.status}")
print(f"Input rows: {result.input_rows}")
for name, stage in result.stages.items():
    print(f"  {name}: {stage.status}")
Enter fullscreen mode Exit fullscreen mode

Real pipeline timing

Stage Time Status
Load 0.0s Success
GoldenCheck scan 8.3s Success
GoldenFlow transform 18.4s Success
GoldenMatch dedupe 30.9s Success
Total 58s

The pipeline's adaptive logic decides which stages to run. If GoldenCheck finds no quality issues, GoldenFlow skips unnecessary transforms. The pipeline keeps data flowing as Polars DataFrames between stages — no disk I/O between tools.

Tip: Cast all columns to string before dedup if your data has mixed types (e.g., birth_year as both integer and string). GoldenMatch requires consistent schemas across pipeline stages.

df = df.cast({col: pl.Utf8 for col in df.columns})
Enter fullscreen mode Exit fullscreen mode

Real Cost Breakdown

Here's what LLM features actually cost on the full 52,288-row UK schools dataset using OpenAI (gpt-4o-mini):

Stage Cost Notes
GoldenCheck LLM boost ~$0.01 50 columns analyzed (135 total, capped at 50)
GoldenFlow LLM correct ~$0.005 Per categorical column targeted
GoldenMatch LLM scorer ~$0.03 Only borderline pairs sent
Total ~$0.05 For 52K rows

GoldenCheck and GoldenFlow costs are nearly flat regardless of row count — they send column summaries, not individual rows. GoldenMatch scales with the number of borderline pairs.

OpenAI is ~5x cheaper than Anthropic for these tasks (gpt-4o-mini vs claude-haiku). Both produce comparable results.

Key Takeaways

  • LLM features are opt-in — call scan_file_with_llm() or pass llm_scorer=True. Set one API key. Everything else auto-detects.
  • GoldenCheck LLM caught 23 findings the profiler missed on 52K rows — including 6 errors (invalid names with numeric characters) that would have gone unnoticed.
  • GoldenFlow's standard transforms handled 204,283 cell corrections without any LLM cost. The LLM corrector adds value on messy categorical data (CRM exports, forms), less so on government controlled vocabularies.
  • GoldenMatch LLM scorer is the game-changer — borderline pairs where "Kingsgate Junior School" and "Kingsgate Primary School" need human-like judgment to resolve.
  • Total cost: ~$0.05 for 52K rows with OpenAI. Budget caps prevent runaway spending. Graceful degradation means nothing breaks if the API fails.

Try It

Download the UK schools data and run it yourself:

pip install goldenpipe[all]
export OPENAI_API_KEY="sk-..."
curl -O https://golden-suite.s3.us-east-2.amazonaws.com/samples/uk_schools.csv
goldenpipe run uk_schools.csv
Enter fullscreen mode Exit fullscreen mode

Or try the non-LLM version in the Playground to see what the pipeline does before adding LLM boost.

Explore the source on GitHub: GoldenCheck | GoldenFlow | GoldenMatch | GoldenPipe


Originally published at https://bensevern.dev

Top comments (0)