DEV Community

Cover image for Agentic Entity Resolution for Messy Product Data
Frendhi
Frendhi

Posted on

Agentic Entity Resolution for Messy Product Data

TL;DR: E-commerce product data is messy. Traditional aggregation fails because "Philips 9W" and "Lampu Philips 9 Watt" look different to a machine. We built CONGA, an AI agent using Pydantic AI, Gemini 2.0 Flash, and Meilisearch, to intelligently "resolve" these messy titles into a clean canonical data, just like a human analyst would, but at scale.

1. Problem

E-commerce product titles are often a mess. They aren't consistent and generally don't follow the correct official product names from the brand. Mapping them to the correct product type and brand is crucial to unlock valuable insights that otherwise can't be concluded from raw scraping data.

For instance, if we want to calculate the average price of "Philips LED Bulbs 9W", we can't simply aggregate the raw data because one seller might list it as "Lampu Philips 9W" while another lists it as "Philips LED Bulb 9 Watt". Without normalization, these are treated as two different products, making accurate analysis impossible.

Imagine having these 5 different titles from 5 different sellers:

  1. "Lampu Philips 9W Putih Murah Promo"
  2. "Philips LED Bulb 9 watt Cool Daylight"
  3. "Bohlam LED Philips 9W E27 6500K"
  4. "PHILIPS ESSENTIAL 9W PUTIH"
  5. "Lampu Hemat Energi Philips 9 Watt Original"

To a human, it's obvious these are all the same "Philips Essential LED Bulb 9W". But to a computer? They are 5 unique strings. Connecting these variations to the single master record is the goal.

2. Idea

If we have a complete taxonomy of the brand's products, we could just match them one by one, right? Seems like a pretty simple thing to do, given some brands might not have hundreds or thousands of unique products. But when we have thousands of incoming raw data points to map, scaling becomes a problem.

Why can't we just delegate this to an LLM? Again, seems like a pretty simple thing to do, right? And in fact, yes, we can.

Enter CONGA (Commerce Labeling Agent). The idea is to create an agent that can intelligently look up products in our master taxonomy, and if it finds a match, link it. If it doesn't find a match, it should identify it, normalize the name, and "learn" it for next time.

3. Implementation

The Design

The tech stack is straightforward but powerful:

  • Framework: Pydantic AI - This simplifies the agentic workflow A LOT.
  • LLM Provider: Google Vertex AI (specifically Gemini 2.0 Flash for speed and cost-efficiency), any other provider should works as long as pydantic-ai supports it.
  • Search Engine: Meilisearch - Used to index the master taxonomy for lightning-fast fuzzy searching.
  • Storage: Any storage or database will do, in my case I use s simple generated taxonomy (SQLite/JSON) to store "new" products the agent finds.

One of the key components is the create_agent function which initializes the agent with the system prompt and tools:

def create_agent(
    system_prompt: Optional[str] = None,
    extra_toolsets: Optional[Sequence[AbstractToolset[AgentDependencies]]] = None,
    model_name: str = 'gemini-2.0-flash',
    verbose: bool = False
) -> Agent:
    """Create CONGA agent with custom or default system prompt."""
    prompt = system_prompt or load_system_prompt()

    # Initialize Google model
    model = GoogleModel(model_name)

    toolsets: list[AbstractToolset[AgentDependencies]] = []
    toolsets.extend(build_env_fastmcp_toolsets())
    if extra_toolsets:
        toolsets.extend(extra_toolsets)

    return Agent(
        model,
        output_type=LabelingResult,
        system_prompt=prompt,
        deps_type=AgentDependencies,
        tools=[search_taxonomy, search_generated_taxonomy, get_taxonomy_context],
        toolsets=toolsets
    )
Enter fullscreen mode Exit fullscreen mode

And here is how we implement the search_taxonomy tool that allows the agent to query our Meilisearch index:

def search_taxonomy(ctx: RunContext[AgentDependencies], query: str, category: Optional[str] = None) -> str:
    """Search for products in the predefined taxonomy."""
    print(f"  πŸ” Searching taxonomy for: '{query}'" + (f" (category: {category})" if category else ""))

    if ctx.deps.use_meilisearch and ctx.deps.meilisearch_tool:
        # Use Meilisearch
        matches = ctx.deps.meilisearch_tool.search(query, category=category)

        if not matches:
            return "No matches found in taxonomy."

        result = f"Found {len(matches)} matches in taxonomy:\n"
        for match in matches:
            result += f"- Normalized Name: {match.get_normalized_name()}\n"
            result += f"  Brand: {match.brand}\n"
            result += f"  Category: {match.category}\n"
            if match.sku_type_complete:
                result += f"  SKU Type: {match.sku_type_complete}\n"
            result += "\n"
    else:
        # Fallback to simple search...
        pass

    return result
Enter fullscreen mode Exit fullscreen mode

How it Works

The agent is given a specific goal: map the raw product title to the master taxonomy. We give it two main tools:

  1. search_taxonomy: To query the official Meilisearch index.
  2. search_generated_taxonomy: To check if we've already seen this "new" product before.

The flow is:

  1. Analyze: The agent reads the raw title.
  2. Search Master: It queries the Meilisearch index.
  3. Decision:
    • If a high-confidence match (>0.7) is found, use it.
    • If NOT found, check the "generated" storage.
    • If still not found, create a new label and flag it as source="new".

The Challenge: Prompt Engineering

Since the tech stack is pretty simple, what's the challenge? It's the System Prompt. This turned out to be more of a prompt engineering challenge than a complex ML Ops problem.

The key was telling the LLM exactly how to search. We couldn't just say "find this product." We had to give it a Search Strategy.

In the final system prompt, we explicitly instructed it to try multiple queries in order:

  1. FIRST TRY: Brand + Product Line + Wattage (e.g., "Philips Essential LED 9W") - Most specific.
  2. IF NO MATCH: Brand + Wattage only (e.g., "Philips 9W") - Catches variations.
  3. IF NO MATCH: Brand + Product Type (e.g., "Philips LED Bulb") - Broad search.

We also added Critical Decision Logic:

"If NOT lighting (e.g., smartphone, laptop), skip taxonomy search β†’ source = 'new'"

By forcing the agent to be persistent and systematic, the match rate improved dramatically.

That alone is a pretty good result, but we can do better if we have a clear understanding of the product's category nuance. As in the case of lighting products, most of the products must have like power, color temperature, color, etc. as a keyword. Adding this contexts to the system prompt will help the agent to determine how to search for the product.

4. The Data

To really test this, we need two things: a source of truth (Taxonomy) and a mess of data (Input).

The Taxonomy (Source of Truth)

This is what our "Golden Record" looks like. It's a clean list of products with their attributes.

Product Name Brand Category Attributes
Philips RadiantLine LEDBulb 5W 6500K Philips LED Bulb 5W, 6500K
Philips Essential LEDBulb 9W 6500K Philips LED Bulb 9W, 6500K

The Input (The Mess)

And this is what we're trying to fix. Real-world raw titles from scraping:

Raw Input Title
Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips
PHILIPS LED BULB 7W RADIANTLINE 3000K WARM WHITE LAMPU HEMAT ENERGI
Samsung Galaxy A54 5G Ram 8/256 GB Garansi Resmi SEIN Indonesia
Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik
iPhone 14 Pro Max 256GB Deep Purple Apple Garansi iBox

5. Result

The results are surprisingly good. I'm satisfied with it actually. Modern days LLMs are really great at this logic-heavy text processing.

The agent returns a structured LabelingResult like this:

{
  "original_title": "Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips",
  "normalized_name": "Philips RadiantLine LEDBulb 5W 6500K",
  "brand": "Philips",
  "category": "LED Lamps",
  "confidence": 0.95,
  "source": "taxonomy"
}
Enter fullscreen mode Exit fullscreen mode

Real-World Test

Here is a snippet from our actual experiment log showing the agent in action. Notice how it finds high-confidence matches in the taxonomy for known products (Hannochs, Avaro) but correctly identifies a new product (Philips LED Strip) when the exact SKU isn't in the master record.

[1/100] Processing: Hannochs Lampu Bohlam LED Vario 45W Cahaya Kuning
  πŸ” Searching taxonomy for: 'Hannochs LED 45W' (category: LED Lamps)
βœ“ SUCCESS
  Normalized Name: Hannochs LED Vario 45 W
  Confidence:      90.00%
  Source:          taxonomy

[2/100] Processing: [Smart ] Avaro Smart Led Light Bulb Rgbww 10W Bluetooth Wireless Iot [Lamp]
  πŸ” Searching taxonomy for: 'Avaro Smart Led Light Bulb 10W'
βœ“ SUCCESS
  Normalized Name: Avaro Smart LED Bluetooth 10 W
  Confidence:      90.00%
  Source:          taxonomy

[3/100] Processing: TERBARU PHILIPS SMART WIFI LED STRIP STARTER KIT 2M - COLOR & TUNABLE T0306
  πŸ” Searching taxonomy for: 'Philips LED strip 2M'
βœ“ SUCCESS
  Normalized Name: Philips Smart Wi-Fi LED Strip 2M
  Confidence:      60.00%
  Source:          new
Enter fullscreen mode Exit fullscreen mode

Here is a sample of how it handles different scenarios:

Raw Input Title Normalized Name Brand Source Confidence
Lampu Philips LED 9W Putih Murah Philips Essential LED Bulb 9 W Philips taxonomy 0.95
Hannochs 10 watt cahaya putih Hannochs LED Bulb 10 W Hannochs generated 0.82
Smart LED Bulb Bardi 12W RGBWW Bardi Smart LED Bulb 12W RGBWW Bardi new 0.45
iPhone 15 Pro Max 256GB iPhone 15 Pro Max 256GB Apple new 0.10

And the cost? Not concerning at all with Gemini Flash. It's a pretty simple solution for a pretty simple job, but it solves a real data quality headache.

Here's the system prompt:

You are CONGA (Commerce Labeling Agent), an AI specialized in labeling and normalizing product titles from e-commerce platforms.

IMPORTANT: The taxonomy database is FOCUSED ON LIGHTING PRODUCTS ONLY (LED bulbs, lamps, etc.).
If a product is NOT a lighting product (e.g., smartphones, laptops, etc.), it will NOT be in the taxonomy.
For non-lighting products, you should create a new label (source = "new").

Your task is to:
1. Analyze product title provided
2. Search predefined taxonomy for matches FIRST (only for lighting products)
3. If a good match is found in taxonomy (confidence > 0.7), use it and STOP - do NOT search generated taxonomy
4. Only search generated taxonomy if NO good taxonomy match is found
5. Extract and normalize product information

TAXONOMY DATA STRUCTURE:
When you search the taxonomy, each entry has these fields:
- brand: e.g., "Philips"
- category: e.g., "LED Lamps"
- product_name: e.g., "Philips Essential LED Bulb 9 W" (THIS IS THE NORMALIZED SKU NAME - use this!)
- sub_brand: e.g., "Philips Essential LED Bulb"

Example taxonomy entry:
{
  "brand": "Philips",
  "category": "LED Lamps",
  "product_name": "Philips Essential LED Bulb 9 W",
  "sub_brand": "Philips Essential LED Bulb"
}

Note: sku_type_complete is the authoritative normalized product name from the taxonomy.

SEARCH STRATEGY - TRY MULTIPLE APPROACHES:
The taxonomy search is fuzzy and powerful. You MUST try multiple search queries before giving up.

For a product like "Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik":

1. FIRST TRY: Brand + Product Line + Wattage
   - Search: "Philips Essential LED 9W"
   - This is the most specific and usually works best

2. IF NO MATCH: Brand + Wattage only
   - Search: "Philips 9W"
   - Catches products where product line name varies

3. IF NO MATCH: Brand + Product Type
   - Search: "Philips LED Bulb"
   - Broader search to find any similar products

4. IF NO MATCH: Just Brand
   - Search: "Philips"
   - See what products exist from this brand

IMPORTANT RULES:
- Try AT LEAST 2-3 different search queries before giving up on taxonomy
- DO NOT add category filters to search queries - let Meilisearch handle relevance
- Extract the sku_type_complete EXACTLY as returned from taxonomy
- Only mark as "new" if you've tried multiple searches and found nothing relevant

Guidelines:
- Extract brand names accurately regardless of industry
- Identify correct product category based on product characteristics
- Normalize product names to be clean, consistent, and professional
- Preserve key technical specifications (wattage, model numbers, sizes, etc.)
- For non-lighting products, set source = "new" since they won't be in the lighting taxonomy

CRITICAL DECISION LOGIC:
1. Identify if the product is a lighting product:
   - If NOT lighting (e.g., smartphone, laptop), skip taxonomy search β†’ source = "new"
   - If lighting product, proceed to search taxonomy

2. Search taxonomy PERSISTENTLY (for lighting products only):
   - Try MULTIPLE search queries (see SEARCH STRATEGY above)
   - Start specific, then broaden: brand+line+wattage β†’ brand+wattage β†’ brand+type β†’ brand only
   - If ANY search finds a good match (confidence >= 0.7):
     * Use the taxonomy result
     * Set source = "taxonomy"
     * Set sku_type_complete = EXACTLY the "sku_type_complete" value from the taxonomy entry
     * DO NOT search generated taxonomy
   - Only proceed to step 3 if ALL search attempts fail

3. If taxonomy has NO good match after multiple tries:
   - Search generated taxonomy
   - If found, set source = "generated"
   - Set sku_type_complete = None (generated taxonomy doesn't have SKU types)

4. If neither has a match:
   - Create a new label
   - Set source = "new"
   - Set sku_type_complete = None

REMEMBER: For lighting products, you should try AT LEAST 2-3 different taxonomy searches before giving up!

Confidence Scoring Rules:
- High confidence (>0.8) when matching existing taxonomy
- Medium confidence (0.5-0.8) when matching generated taxonomy
- Lower confidence (<0.5) when creating new labels

Source Classification:
- "taxonomy" when using predefined taxonomy (always prefer this if a good match exists)
- "generated" when using generated taxonomy (only if no good taxonomy match)
- "new" when creating new labels (only if neither taxonomy nor generated has a match)

Output Requirements:
Always return a complete LabelingResult with all fields filled.
The source field MUST accurately reflect where the data came from.
Enter fullscreen mode Exit fullscreen mode

Disclaimer: This is not traditional NER (Named Entity Recognition). Traditional NER models extract entities (like "Philips" or "9W") but don't inherently link them to a master record. This approach leverages the reasoning power of LLMs to mimic a human data analyst: searching, comparing, and making a decision to map the messy input to a canonical entry (Entity Resolution).

Top comments (0)