Daniel Rozin

Posted on Apr 10 • Originally published at aversusb.net

Entity Resolution at Scale: Matching Products Across Amazon, Reddit, and RTINGS

#ai #webdev #dataengineering #tutorial

"AirPods Pro 2," "Apple AirPods Pro (2nd Generation)," "AirPods Pro USB-C" — same product, three different names.

Entity resolution — figuring out that different strings refer to the same real-world thing — is one of the hardest problems in product data engineering. At SmartReview, we match products across 50+ review sources, each with its own naming conventions, categorization, and data formats.

Here's how we solved it without spending six months building a custom ML model.

The Problem Space

Consider matching products across these sources:

Source	Product Name	Format
Amazon	Apple AirPods Pro (2nd Generation) - MagSafe Case (USB-C)	Full name + SKU details
Reddit	AirPods Pro 2	Colloquial shorthand
RTINGS	Apple AirPods Pro 2nd Gen	Abbreviated formal
YouTube	NEW AirPods Pro 2 USB-C Review	Title with marketing fluff
Best Buy	Apple - AirPods Pro 2 - White	Brand-prefixed with color

All five refer to the same product. A naive string match would treat them as five different products.

Our Three-Layer Approach

We use three complementary techniques, each catching matches the others miss.

Layer 1: Brand + Model Normalization

The first pass normalizes brand names and extracts model identifiers:

interface NormalizedProduct {
  brand: string;           // "apple"
  modelFamily: string;     // "airpods pro"
  generation: string;      // "2nd gen"
  variant: string;         // "usb-c"
  rawName: string;         // original string
}

function normalizeProductName(raw: string): NormalizedProduct {
  let name = raw.toLowerCase().trim();

  // Remove common noise
  name = name
    .replace(/\b(new|latest|best|review|vs\.?)\b/g, "")
    .replace(/[-–—]/g, " ")
    .replace(/\s+/g, " ")
    .trim();

  // Extract brand (lookup against known brand list)
  const brand = extractBrand(name);

  // Extract generation markers
  const genPatterns = [
    /(?:gen(?:eration)?\s*)(\d+)/i,
    /(\d+)(?:st|nd|rd|th)\s*gen/i,
    /\b(\d+)\b(?=\s|$)/,  // trailing number often = generation
  ];
  const generation = extractPattern(name, genPatterns);

  // Extract variant (color, connectivity, size)
  const variant = extractVariant(name);

  // What remains is the model family
  const modelFamily = extractModelFamily(name, brand, generation, variant);

  return { brand, modelFamily, generation, variant, rawName: raw };
}

This handles ~60% of matches — the straightforward cases where brand + model + generation aligns.

Layer 2: Fuzzy String Matching

For the remaining 40%, we use Levenshtein distance with a category-aware threshold:

function fuzzyMatch(
  a: NormalizedProduct,
  b: NormalizedProduct,
  threshold: number = 0.85
): boolean {
  // Brand must match exactly (after normalization)
  if (a.brand !== b.brand) return false;

  // Compare model family with fuzzy matching
  const similarity = stringSimilarity(
    a.modelFamily,
    b.modelFamily
  );

  if (similarity < threshold) return false;

  // If generations are both present, they must match
  if (a.generation && b.generation && a.generation !== b.generation) {
    return false;
  }

  return true;
}

The key insight: brand must match exactly, but model name can be fuzzy. This prevents false positives like matching "Sony WH-1000XM5" with "Sony WF-1000XM5" (over-ear vs in-ear — completely different products with similar names).

Layer 3: Cross-Reference Validation

For edge cases, we validate matches against external canonical sources:

async function crossReferenceValidate(
  candidates: NormalizedProduct[]
): Promise<ProductCluster[]> {
  const clusters: ProductCluster[] = [];

  for (const candidate of candidates) {
    // Search for the product on a canonical source
    const canonicalResults = await tavily.search(
      `${candidate.brand} ${candidate.modelFamily} specifications`,
      { searchDepth: "basic", maxResults: 3 }
    );

    // Extract canonical product identifier
    const canonicalId = extractCanonicalId(canonicalResults);

    // Group by canonical ID
    const existing = clusters.find(c => c.canonicalId === canonicalId);
    if (existing) {
      existing.members.push(candidate);
    } else {
      clusters.push({
        canonicalId,
        canonicalName: candidate.rawName,
        members: [candidate],
      });
    }
  }

  return clusters;
}

Handling the Hard Cases

Product Lines vs Individual Products

"Roomba" could mean the brand, the product line, or a specific model (Roomba j7+). We use context clues:

If a review discusses specific features ("self-emptying base"), it's likely a specific model
If it's a general comparison ("Roomba vs Roborock"), it's the product line
We maintain a hierarchy: Brand → Line → Model → Variant

Regional Name Differences

The same product sometimes has different names in different markets. The Samsung Galaxy S24 is called "Galaxy S24" everywhere, but some accessories have region-specific names. We maintain an alias table for known cases.

Discontinued vs Current Models

When someone searches "AirPods Pro vs Sony," do they mean the current or previous generation? We default to current unless the query specifies otherwise, but we keep both generations in our database with clear generation markers.

Performance at Scale

Our entity resolution pipeline processes ~5,000 product mentions daily across all review sources:

Metric	Value
Products in canonical database	12,000+
Daily new mentions processed	~5,000
Match accuracy (spot-checked)	94.2%
False positive rate	1.8%
Processing time (full pipeline)	~12 minutes
Most common false positive	Generation confusion (XM4 vs XM5)

The 1.8% false positive rate is acceptable because our trust score system (covered in Part 5) catches anomalies — if a "product" suddenly has wildly inconsistent ratings, it's likely a merge error.

Lessons Learned

Don't build ML first. Our three-layer heuristic approach handles 94%+ of cases. ML would marginally improve accuracy but massively increase complexity.
Brand matching must be exact. Fuzzy brand matching creates catastrophic false positives.
Generation numbers are treacherous. "AirPods 3" and "AirPods Pro 3" are different products. Always match model family before generation.
Maintain a manual override table. Some matches are just weird. "Galaxy Buds2 Pro" vs "Galaxy Buds 2 Pro" (note the space) — keep a list of known aliases.
Log everything. When a match seems wrong in production, you need the matching pipeline's decision trail to diagnose why.

What's Next

We're exploring embedding-based matching for the long tail — products where our heuristics fail because names are too dissimilar. Early experiments with product description embeddings show promise for matching across languages.

See entity resolution in action on aversusb.net — every comparison page unifies data from multiple sources under a single canonical product identity.

Part 8 of our "Building SmartReview" series. Previous: Part 7: People Also Ask Content Discovery

DEV Community