DEV Community

Cover image for Deduplicating 401,000 Equipment Auction Records with LLM Calibration
benzsevern
benzsevern

Posted on • Originally published at bensevern.dev

Deduplicating 401,000 Equipment Auction Records with LLM Calibration

Equipment auction data is messy. The same Caterpillar D6 bulldozer sells three times across different states, and each auction house records the model slightly differently. "580 Super L" in one record becomes "580SL" in another. "D6H LGP" appears alongside "D6H" — same machine, different level of detail.

We took the Kaggle Bulldozer Blue Book dataset401,125 auction records across 53 columns — and threw it at GoldenMatch to see what happens.

The result: 27,937 duplicate clusters, 384,650 records matched, and the LLM learned the perfect threshold from just 200 pairs. Total LLM cost: $0.01.

The Dataset

The Bulldozer dataset comes from a Kaggle competition on predicting auction sale prices. Each row is a single auction transaction:

  • 401,125 rows, 53 columns
  • Key fields: fiModelDesc (model description), fiBaseModel (base model code), YearMade, state, SalePrice, ProductGroup, fiProductClassDesc
  • The same machine appears multiple times across different auctions, states, and years
  • Model descriptions vary: "580 Super L", "580SL", "580super l" are the same machine
  • 90+ oversized blocking groups (e.g., "D000" with 61,861 records sharing the same soundex code)

The Challenge: Auto-Config Gets It Wrong

Our first attempt used GoldenMatch's zero-config mode:

result = goldenmatch.dedupe_df(df, llm_scorer=True)
Enter fullscreen mode Exit fullscreen mode

Zero matches. The auto-config misclassified the columns:

Column Expected Type Auto-Config Said
SalesID Identifier "phone" (7-digit integers look like phone numbers)
SalePrice Numeric/Price "zip" (5-digit round numbers look like zip codes)
fiModelDesc Text (most important!) Dropped (fuzzy field limit of 5, picked by column order)

The ID pattern was checked last in the classification chain, so data profiling overrode the name heuristic. And fiModelDesc — the single most discriminating field — was arbitrarily dropped because it appeared after five other columns.

We fixed all three issues in v1.2.6:

  1. ID patterns checked firstSalesID now correctly classified as "identifier"
  2. Price/cost/amount patterns addedSalePrice stays "numeric" instead of becoming "zip"
  3. Utility-based field ranking — fields ranked by cardinality x completeness x string_length instead of column order. fiModelDesc (high cardinality, long strings) naturally ranks above UsageBand (3 values, 4 chars)

The Configuration

Here's the optimal config we landed on:


from goldenmatch.config.schemas import (
    GoldenMatchConfig, MatchkeyConfig, MatchkeyField,
    BlockingConfig, BlockingKeyConfig, LLMScorerConfig, BudgetConfig,
)

config = GoldenMatchConfig(
    # Multi-pass blocking with ANN fallback
    blocking=BlockingConfig(
        strategy="multi_pass",
        passes=[
            BlockingKeyConfig(
                fields=["fiModelDesc", "state"],
                transforms=["lowercase", "strip"],
            ),
            BlockingKeyConfig(
                fields=["fiModelDesc", "ProductGroup"],
                transforms=["lowercase", "strip"],
            ),
            BlockingKeyConfig(
                fields=["fiBaseModel"],
                transforms=["lowercase", "soundex"],
            ),
        ],
        max_block_size=1000,
        skip_oversized=True,
        ann_column="__equipment_text__",  # ANN fallback
        ann_top_k=20,
    ),

    # Weighted fuzzy scoring
    matchkeys=[MatchkeyConfig(
        name="equipment_match",
        type="weighted",
        threshold=0.90,
        fields=[
            MatchkeyField(field="fiModelDesc", scorer="ensemble",
                          weight=2.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="fiBaseModel", scorer="jaro_winkler",
                          weight=1.5, transforms=["lowercase", "strip"]),
            MatchkeyField(field="fiProductClassDesc", scorer="token_sort",
                          weight=1.0, transforms=["lowercase", "strip"]),
            MatchkeyField(field="YearMade", scorer="exact", weight=0.8),
            MatchkeyField(field="state", scorer="exact", weight=0.3),
            MatchkeyField(field="ProductGroup", scorer="exact", weight=0.5),
        ],
    )],

    # LLM calibration
    llm_scorer=LLMScorerConfig(
        enabled=True,
        provider="openai",
        auto_threshold=0.95,
        candidate_lo=0.85,
        candidate_hi=0.95,
        batch_size=75,
        max_workers=3,
        budget=BudgetConfig(max_cost_usd=1.00, max_calls=500),
    ),
)

result = goldenmatch.dedupe_df(df, config=config)
Enter fullscreen mode Exit fullscreen mode

Three blocking passes catch different types of duplicates. The ensemble scorer (max of jaro_winkler, token_sort, and soundex) handles model description variants. Exact scorers on year, state, and product group anchor the match.

ANN Hybrid Blocking: Don't Skip Oversized Blocks

The multi-pass blocking produced 37,492 blocks, but 90+ were oversized (>1,000 records) and would normally be skipped. The soundex pass creates blocks like "D000" with 61,861 records — every Caterpillar D-series machine in the dataset.

Skipping these blocks means missing legitimate duplicates. But processing them is too expensive (61K records = 1.9 billion pair comparisons).

ANN hybrid blocking solves this. When a block is oversized, GoldenMatch:

  1. Extracts unique text values from the block (61,861 records might have only 187 unique model descriptions)
  2. Embeds the unique texts via Vertex AI (text-embedding-004)
  3. Uses FAISS to find nearest neighbors among the unique texts
  4. Groups neighbors into sub-blocks via Union-Find

The result: 15 oversized blocks were recovered into 363 sub-blocks, matching 949 additional records that string blocking missed entirely. Embedding 187 unique texts took 4 seconds.

Blocks that are truly massive (>10x max_block_size, like "D000" at 61K) are still skipped — the soundex is too coarse to be useful.

Iterative LLM Calibration: 200 Pairs Instead of 37,500

After fuzzy scoring, GoldenMatch found 13.2 million pairs above 0.85. Of these, 13.1 million scored above 0.95 (auto-accepted) and 135,530 fell in the borderline range (0.85-0.95).

The old approach would score all 135,530 borderline pairs via the LLM — that's 1,807 API calls at batch_size=75, roughly $0.50 and 25 minutes. On our first attempt, it actually produced 23.7 million candidates and hit OpenAI's rate limit after 80 batches.

Iterative calibration takes a different approach:

Round 1: Sample 100 pairs spread across the 0.85-0.95 score range. Send to GPT-4o-mini. Result: 2 matches, 98 non-matches. Grid search finds threshold = 0.946.

Round 2: Sample 100 pairs near the threshold (0.916-0.948). Result: 11 matches, 89 non-matches. Updated threshold = 0.947. Delta = 0.001 (below convergence threshold of 0.01).

Converged. Apply threshold 0.947 to all 135,530 candidates: 1,533 promoted to match, 133,810 kept at their fuzzy score.

Total: 200 pairs scored, 2 API calls per round, $0.01, 42 seconds.

The key insight: we don't need to ask the LLM about every borderline pair. We just need to learn where the boundary is. A stratified sample of 100 pairs across the score range is enough to find it, and a focused sample near the boundary confirms it.

Results

Metric Value
Total time 323 seconds
Records matched 384,650 (95.9%)
Unique (singletons) 16,475 (4.1%)
Duplicate clusters 27,937
Golden records 27,426
LLM pairs scored 200
LLM cost ~$0.01
ANN sub-blocks 363 (from 15 oversized blocks)
Cluster confidence >= 0.4 87.7%

Example Clusters

Cluster 2 (size 3):
  950FII | 950 | 1996 | North Carolina | $57,000
  950FII | 950 | 1996 | North Carolina | $60,000
  950FII | 950 | 1996 | North Carolina | $91,000

Cluster 8 (size 3):
  416D | 416 | 2001 | Illinois | $27,000
  416D | 416 | 2001 | Illinois | $27,000
  416D | 416 | 2001 | Illinois | $26,500

Cluster 33 (size 4):
  750BLT | 750 | 1000 | Michigan | $14,000
  750BLT | 750 | 1000 | California | $18,000
  750BLT | 750 | 1000 | Alabama | $21,500
  750BLT | 750 | 1000 | Arkansas | $13,000
Enter fullscreen mode Exit fullscreen mode

Same equipment model, same year, same or different states, different sale prices. These are the same machine being resold at different auctions.

Accuracy: Optimal vs Previous

We compared the v1.2.6 optimal config against a simpler string-only config:

Metric String Only Optimal (ANN + LLM)
Clusters 28,703 27,937
Records matched 383,701 384,650 (+949)
Confidence >= 0.4 77.1% 87.7%
LLM threshold 0.909 (too permissive) 0.947 (precise)
LLM pairs promoted 8,647,816 1,533

The string-only run's LLM threshold was too low (0.909), promoting 8.6 million pairs — many of which were different equipment variants sharing a model number (e.g., "550BLT" vs "550JLGP"). The optimal run's tighter threshold (0.947) correctly kept these as separate records.

The 34 records that were "lost" (matched in the string-only run, singletons in optimal) were verified as false positives — different equipment variants with generic model numbers like "530", "550", "325" that have thousands of records across 10+ variants.

Try It Yourself

pip install goldenmatch==1.2.6
Enter fullscreen mode Exit fullscreen mode

The full example is at examples/equipment_dedup.py in the GoldenMatch repo. Download the Bulldozer dataset from Kaggle and run:

export OPENAI_API_KEY=sk-...
export DATA_PATH=Train.csv
python examples/equipment_dedup.py
Enter fullscreen mode Exit fullscreen mode

ANN hybrid blocking is optional — it works without Vertex AI, just skips oversized blocks like before. The LLM calibration alone is the biggest accuracy improvement.

What's Next

The iterative calibration approach opens up a path to persistent learning: save the LLM's threshold decisions across runs, so subsequent deduplications start with a pre-calibrated threshold instead of learning from scratch. We're building this as part of GoldenMatch's Learning Memory system.

For now, v1.2.6 is live on PyPI and the full configuration is documented in the LLM Integration guide and Blocking Strategies guide.


Originally published at https://bensevern.dev

Top comments (0)