"AirPods Pro 2," "Apple AirPods Pro (2nd Generation)," "AirPods Pro USB-C" — same product, three different names.
Entity resolution — figuring out that different strings refer to the same real-world thing — is one of the hardest problems in product data engineering. At SmartReview, we match products across 50+ review sources, each with its own naming conventions, categorization, and data formats.
Here's how we solved it without spending six months building a custom ML model.
The Problem Space
Consider matching products across these sources:
| Source | Product Name | Format |
|---|---|---|
| Amazon | Apple AirPods Pro (2nd Generation) - MagSafe Case (USB-C) | Full name + SKU details |
| AirPods Pro 2 | Colloquial shorthand | |
| RTINGS | Apple AirPods Pro 2nd Gen | Abbreviated formal |
| YouTube | NEW AirPods Pro 2 USB-C Review | Title with marketing fluff |
| Best Buy | Apple - AirPods Pro 2 - White | Brand-prefixed with color |
All five refer to the same product. A naive string match would treat them as five different products.
Our Three-Layer Approach
We use three complementary techniques, each catching matches the others miss.
Layer 1: Brand + Model Normalization
The first pass normalizes brand names and extracts model identifiers:
interface NormalizedProduct {
brand: string; // "apple"
modelFamily: string; // "airpods pro"
generation: string; // "2nd gen"
variant: string; // "usb-c"
rawName: string; // original string
}
function normalizeProductName(raw: string): NormalizedProduct {
let name = raw.toLowerCase().trim();
// Remove common noise
name = name
.replace(/\b(new|latest|best|review|vs\.?)\b/g, "")
.replace(/[-–—]/g, " ")
.replace(/\s+/g, " ")
.trim();
// Extract brand (lookup against known brand list)
const brand = extractBrand(name);
// Extract generation markers
const genPatterns = [
/(?:gen(?:eration)?\s*)(\d+)/i,
/(\d+)(?:st|nd|rd|th)\s*gen/i,
/\b(\d+)\b(?=\s|$)/, // trailing number often = generation
];
const generation = extractPattern(name, genPatterns);
// Extract variant (color, connectivity, size)
const variant = extractVariant(name);
// What remains is the model family
const modelFamily = extractModelFamily(name, brand, generation, variant);
return { brand, modelFamily, generation, variant, rawName: raw };
}
This handles ~60% of matches — the straightforward cases where brand + model + generation aligns.
Layer 2: Fuzzy String Matching
For the remaining 40%, we use Levenshtein distance with a category-aware threshold:
function fuzzyMatch(
a: NormalizedProduct,
b: NormalizedProduct,
threshold: number = 0.85
): boolean {
// Brand must match exactly (after normalization)
if (a.brand !== b.brand) return false;
// Compare model family with fuzzy matching
const similarity = stringSimilarity(
a.modelFamily,
b.modelFamily
);
if (similarity < threshold) return false;
// If generations are both present, they must match
if (a.generation && b.generation && a.generation !== b.generation) {
return false;
}
return true;
}
The key insight: brand must match exactly, but model name can be fuzzy. This prevents false positives like matching "Sony WH-1000XM5" with "Sony WF-1000XM5" (over-ear vs in-ear — completely different products with similar names).
Layer 3: Cross-Reference Validation
For edge cases, we validate matches against external canonical sources:
async function crossReferenceValidate(
candidates: NormalizedProduct[]
): Promise<ProductCluster[]> {
const clusters: ProductCluster[] = [];
for (const candidate of candidates) {
// Search for the product on a canonical source
const canonicalResults = await tavily.search(
`${candidate.brand} ${candidate.modelFamily} specifications`,
{ searchDepth: "basic", maxResults: 3 }
);
// Extract canonical product identifier
const canonicalId = extractCanonicalId(canonicalResults);
// Group by canonical ID
const existing = clusters.find(c => c.canonicalId === canonicalId);
if (existing) {
existing.members.push(candidate);
} else {
clusters.push({
canonicalId,
canonicalName: candidate.rawName,
members: [candidate],
});
}
}
return clusters;
}
Handling the Hard Cases
Product Lines vs Individual Products
"Roomba" could mean the brand, the product line, or a specific model (Roomba j7+). We use context clues:
- If a review discusses specific features ("self-emptying base"), it's likely a specific model
- If it's a general comparison ("Roomba vs Roborock"), it's the product line
- We maintain a hierarchy: Brand → Line → Model → Variant
Regional Name Differences
The same product sometimes has different names in different markets. The Samsung Galaxy S24 is called "Galaxy S24" everywhere, but some accessories have region-specific names. We maintain an alias table for known cases.
Discontinued vs Current Models
When someone searches "AirPods Pro vs Sony," do they mean the current or previous generation? We default to current unless the query specifies otherwise, but we keep both generations in our database with clear generation markers.
Performance at Scale
Our entity resolution pipeline processes ~5,000 product mentions daily across all review sources:
| Metric | Value |
|---|---|
| Products in canonical database | 12,000+ |
| Daily new mentions processed | ~5,000 |
| Match accuracy (spot-checked) | 94.2% |
| False positive rate | 1.8% |
| Processing time (full pipeline) | ~12 minutes |
| Most common false positive | Generation confusion (XM4 vs XM5) |
The 1.8% false positive rate is acceptable because our trust score system (covered in Part 5) catches anomalies — if a "product" suddenly has wildly inconsistent ratings, it's likely a merge error.
Lessons Learned
- Don't build ML first. Our three-layer heuristic approach handles 94%+ of cases. ML would marginally improve accuracy but massively increase complexity.
- Brand matching must be exact. Fuzzy brand matching creates catastrophic false positives.
- Generation numbers are treacherous. "AirPods 3" and "AirPods Pro 3" are different products. Always match model family before generation.
- Maintain a manual override table. Some matches are just weird. "Galaxy Buds2 Pro" vs "Galaxy Buds 2 Pro" (note the space) — keep a list of known aliases.
- Log everything. When a match seems wrong in production, you need the matching pipeline's decision trail to diagnose why.
What's Next
We're exploring embedding-based matching for the long tail — products where our heuristics fail because names are too dissimilar. Early experiments with product description embeddings show promise for matching across languages.
See entity resolution in action on aversusb.net — every comparison page unifies data from multiple sources under a single canonical product identity.
Part 8 of our "Building SmartReview" series. Previous: Part 7: People Also Ask Content Discovery
Top comments (0)