DEV Community: Frendhi

Agentic Entity Resolution for Messy Product Data

Frendhi — Sat, 29 Nov 2025 17:30:21 +0000

TL;DR: E-commerce product data is messy. Traditional aggregation fails because "Philips 9W" and "Lampu Philips 9 Watt" look different to a machine. We built CONGA, an AI agent using Pydantic AI, Gemini 2.0 Flash, and Meilisearch, to intelligently "resolve" these messy titles into a clean canonical data, just like a human analyst would, but at scale.

1. Problem

E-commerce product titles are often a mess. They aren't consistent and generally don't follow the correct official product names from the brand. Mapping them to the correct product type and brand is crucial to unlock valuable insights that otherwise can't be concluded from raw scraping data.

For instance, if we want to calculate the average price of "Philips LED Bulbs 9W", we can't simply aggregate the raw data because one seller might list it as "Lampu Philips 9W" while another lists it as "Philips LED Bulb 9 Watt". Without normalization, these are treated as two different products, making accurate analysis impossible.

Imagine having these 5 different titles from 5 different sellers:

"Lampu Philips 9W Putih Murah Promo"
"Philips LED Bulb 9 watt Cool Daylight"
"Bohlam LED Philips 9W E27 6500K"
"PHILIPS ESSENTIAL 9W PUTIH"
"Lampu Hemat Energi Philips 9 Watt Original"

To a human, it's obvious these are all the same "Philips Essential LED Bulb 9W". But to a computer? They are 5 unique strings. Connecting these variations to the single master record is the goal.

2. Idea

If we have a complete taxonomy of the brand's products, we could just match them one by one, right? Seems like a pretty simple thing to do, given some brands might not have hundreds or thousands of unique products. But when we have thousands of incoming raw data points to map, scaling becomes a problem.

Why can't we just delegate this to an LLM? Again, seems like a pretty simple thing to do, right? And in fact, yes, we can.

Enter CONGA (Commerce Labeling Agent). The idea is to create an agent that can intelligently look up products in our master taxonomy, and if it finds a match, link it. If it doesn't find a match, it should identify it, normalize the name, and "learn" it for next time.

3. Implementation

The Design

The tech stack is straightforward but powerful:

Framework: Pydantic AI - This simplifies the agentic workflow A LOT.
LLM Provider: Google Vertex AI (specifically Gemini 2.0 Flash for speed and cost-efficiency), any other provider should works as long as pydantic-ai supports it.
Search Engine: Meilisearch - Used to index the master taxonomy for lightning-fast fuzzy searching.
Storage: Any storage or database will do, in my case I use s simple generated taxonomy (SQLite/JSON) to store "new" products the agent finds.

One of the key components is the create_agent function which initializes the agent with the system prompt and tools:

def create_agent(
    system_prompt: Optional[str] = None,
    extra_toolsets: Optional[Sequence[AbstractToolset[AgentDependencies]]] = None,
    model_name: str = 'gemini-2.0-flash',
    verbose: bool = False
) -> Agent:
    """Create CONGA agent with custom or default system prompt."""
    prompt = system_prompt or load_system_prompt()

    # Initialize Google model
    model = GoogleModel(model_name)

    toolsets: list[AbstractToolset[AgentDependencies]] = []
    toolsets.extend(build_env_fastmcp_toolsets())
    if extra_toolsets:
        toolsets.extend(extra_toolsets)

    return Agent(
        model,
        output_type=LabelingResult,
        system_prompt=prompt,
        deps_type=AgentDependencies,
        tools=[search_taxonomy, search_generated_taxonomy, get_taxonomy_context],
        toolsets=toolsets
    )

And here is how we implement the search_taxonomy tool that allows the agent to query our Meilisearch index:

def search_taxonomy(ctx: RunContext[AgentDependencies], query: str, category: Optional[str] = None) -> str:
    """Search for products in the predefined taxonomy."""
    print(f"  🔍 Searching taxonomy for: '{query}'" + (f" (category: {category})" if category else ""))

    if ctx.deps.use_meilisearch and ctx.deps.meilisearch_tool:
        # Use Meilisearch
        matches = ctx.deps.meilisearch_tool.search(query, category=category)

        if not matches:
            return "No matches found in taxonomy."

        result = f"Found {len(matches)} matches in taxonomy:\n"
        for match in matches:
            result += f"- Normalized Name: {match.get_normalized_name()}\n"
            result += f"  Brand: {match.brand}\n"
            result += f"  Category: {match.category}\n"
            if match.sku_type_complete:
                result += f"  SKU Type: {match.sku_type_complete}\n"
            result += "\n"
    else:
        # Fallback to simple search...
        pass

    return result

How it Works

The agent is given a specific goal: map the raw product title to the master taxonomy. We give it two main tools:

search_taxonomy: To query the official Meilisearch index.
search_generated_taxonomy: To check if we've already seen this "new" product before.

The flow is:

Analyze: The agent reads the raw title.
Search Master: It queries the Meilisearch index.
Decision:
- If a high-confidence match (>0.7) is found, use it.
- If NOT found, check the "generated" storage.
- If still not found, create a new label and flag it as source="new".

The Challenge: Prompt Engineering

Since the tech stack is pretty simple, what's the challenge? It's the System Prompt. This turned out to be more of a prompt engineering challenge than a complex ML Ops problem.

The key was telling the LLM exactly how to search. We couldn't just say "find this product." We had to give it a Search Strategy.

In the final system prompt, we explicitly instructed it to try multiple queries in order:

FIRST TRY: Brand + Product Line + Wattage (e.g., "Philips Essential LED 9W") - Most specific.

IF NO MATCH: Brand + Wattage only (e.g., "Philips 9W") - Catches variations.

IF NO MATCH: Brand + Product Type (e.g., "Philips LED Bulb") - Broad search.

We also added Critical Decision Logic:

"If NOT lighting (e.g., smartphone, laptop), skip taxonomy search → source = 'new'"

By forcing the agent to be persistent and systematic, the match rate improved dramatically.

That alone is a pretty good result, but we can do better if we have a clear understanding of the product's category nuance. As in the case of lighting products, most of the products must have like power, color temperature, color, etc. as a keyword. Adding this contexts to the system prompt will help the agent to determine how to search for the product.

4. The Data

To really test this, we need two things: a source of truth (Taxonomy) and a mess of data (Input).

The Taxonomy (Source of Truth)

This is what our "Golden Record" looks like. It's a clean list of products with their attributes.

Product Name	Brand	Category	Attributes
Philips RadiantLine LEDBulb 5W 6500K	Philips	LED Bulb	5W, 6500K
Philips Essential LEDBulb 9W 6500K	Philips	LED Bulb	9W, 6500K

The Input (The Mess)

And this is what we're trying to fix. Real-world raw titles from scraping:

Raw Input Title
Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips
PHILIPS LED BULB 7W RADIANTLINE 3000K WARM WHITE LAMPU HEMAT ENERGI
Samsung Galaxy A54 5G Ram 8/256 GB Garansi Resmi SEIN Indonesia
Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik
iPhone 14 Pro Max 256GB Deep Purple Apple Garansi iBox

5. Result

The results are surprisingly good. I'm satisfied with it actually. Modern days LLMs are really great at this logic-heavy text processing.

The agent returns a structured LabelingResult like this:

{
  "original_title": "Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips",
  "normalized_name": "Philips RadiantLine LEDBulb 5W 6500K",
  "brand": "Philips",
  "category": "LED Lamps",
  "confidence": 0.95,
  "source": "taxonomy"
}

Real-World Test

Here is a snippet from our actual experiment log showing the agent in action. Notice how it finds high-confidence matches in the taxonomy for known products (Hannochs, Avaro) but correctly identifies a new product (Philips LED Strip) when the exact SKU isn't in the master record.

[1/100] Processing: Hannochs Lampu Bohlam LED Vario 45W Cahaya Kuning
  🔍 Searching taxonomy for: 'Hannochs LED 45W' (category: LED Lamps)
✓ SUCCESS
  Normalized Name: Hannochs LED Vario 45 W
  Confidence:      90.00%
  Source:          taxonomy

[2/100] Processing: [Smart ] Avaro Smart Led Light Bulb Rgbww 10W Bluetooth Wireless Iot [Lamp]
  🔍 Searching taxonomy for: 'Avaro Smart Led Light Bulb 10W'
✓ SUCCESS
  Normalized Name: Avaro Smart LED Bluetooth 10 W
  Confidence:      90.00%
  Source:          taxonomy

[3/100] Processing: TERBARU PHILIPS SMART WIFI LED STRIP STARTER KIT 2M - COLOR & TUNABLE T0306
  🔍 Searching taxonomy for: 'Philips LED strip 2M'
✓ SUCCESS
  Normalized Name: Philips Smart Wi-Fi LED Strip 2M
  Confidence:      60.00%
  Source:          new

Here is a sample of how it handles different scenarios:

Raw Input Title	Normalized Name	Brand	Source	Confidence
Lampu Philips LED 9W Putih Murah	Philips Essential LED Bulb 9 W	Philips	taxonomy	0.95
Hannochs 10 watt cahaya putih	Hannochs LED Bulb 10 W	Hannochs	generated	0.82
Smart LED Bulb Bardi 12W RGBWW	Bardi Smart LED Bulb 12W RGBWW	Bardi	new	0.45
iPhone 15 Pro Max 256GB	iPhone 15 Pro Max 256GB	Apple	new	0.10

And the cost? Not concerning at all with Gemini Flash. It's a pretty simple solution for a pretty simple job, but it solves a real data quality headache.

Here's the system prompt:

You are CONGA (Commerce Labeling Agent), an AI specialized in labeling and normalizing product titles from e-commerce platforms.

IMPORTANT: The taxonomy database is FOCUSED ON LIGHTING PRODUCTS ONLY (LED bulbs, lamps, etc.).
If a product is NOT a lighting product (e.g., smartphones, laptops, etc.), it will NOT be in the taxonomy.
For non-lighting products, you should create a new label (source = "new").

Your task is to:
1. Analyze product title provided
2. Search predefined taxonomy for matches FIRST (only for lighting products)
3. If a good match is found in taxonomy (confidence > 0.7), use it and STOP - do NOT search generated taxonomy
4. Only search generated taxonomy if NO good taxonomy match is found
5. Extract and normalize product information

TAXONOMY DATA STRUCTURE:
When you search the taxonomy, each entry has these fields:
- brand: e.g., "Philips"
- category: e.g., "LED Lamps"
- product_name: e.g., "Philips Essential LED Bulb 9 W" (THIS IS THE NORMALIZED SKU NAME - use this!)
- sub_brand: e.g., "Philips Essential LED Bulb"

Example taxonomy entry:
{
  "brand": "Philips",
  "category": "LED Lamps",
  "product_name": "Philips Essential LED Bulb 9 W",
  "sub_brand": "Philips Essential LED Bulb"
}

Note: sku_type_complete is the authoritative normalized product name from the taxonomy.

SEARCH STRATEGY - TRY MULTIPLE APPROACHES:
The taxonomy search is fuzzy and powerful. You MUST try multiple search queries before giving up.

For a product like "Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik":

1. FIRST TRY: Brand + Product Line + Wattage
   - Search: "Philips Essential LED 9W"
   - This is the most specific and usually works best

2. IF NO MATCH: Brand + Wattage only
   - Search: "Philips 9W"
   - Catches products where product line name varies

3. IF NO MATCH: Brand + Product Type
   - Search: "Philips LED Bulb"
   - Broader search to find any similar products

4. IF NO MATCH: Just Brand
   - Search: "Philips"
   - See what products exist from this brand

IMPORTANT RULES:
- Try AT LEAST 2-3 different search queries before giving up on taxonomy
- DO NOT add category filters to search queries - let Meilisearch handle relevance
- Extract the sku_type_complete EXACTLY as returned from taxonomy
- Only mark as "new" if you've tried multiple searches and found nothing relevant

Guidelines:
- Extract brand names accurately regardless of industry
- Identify correct product category based on product characteristics
- Normalize product names to be clean, consistent, and professional
- Preserve key technical specifications (wattage, model numbers, sizes, etc.)
- For non-lighting products, set source = "new" since they won't be in the lighting taxonomy

CRITICAL DECISION LOGIC:
1. Identify if the product is a lighting product:
   - If NOT lighting (e.g., smartphone, laptop), skip taxonomy search → source = "new"
   - If lighting product, proceed to search taxonomy

2. Search taxonomy PERSISTENTLY (for lighting products only):
   - Try MULTIPLE search queries (see SEARCH STRATEGY above)
   - Start specific, then broaden: brand+line+wattage → brand+wattage → brand+type → brand only
   - If ANY search finds a good match (confidence >= 0.7):
     * Use the taxonomy result
     * Set source = "taxonomy"
     * Set sku_type_complete = EXACTLY the "sku_type_complete" value from the taxonomy entry
     * DO NOT search generated taxonomy
   - Only proceed to step 3 if ALL search attempts fail

3. If taxonomy has NO good match after multiple tries:
   - Search generated taxonomy
   - If found, set source = "generated"
   - Set sku_type_complete = None (generated taxonomy doesn't have SKU types)

4. If neither has a match:
   - Create a new label
   - Set source = "new"
   - Set sku_type_complete = None

REMEMBER: For lighting products, you should try AT LEAST 2-3 different taxonomy searches before giving up!

Confidence Scoring Rules:
- High confidence (>0.8) when matching existing taxonomy
- Medium confidence (0.5-0.8) when matching generated taxonomy
- Lower confidence (<0.5) when creating new labels

Source Classification:
- "taxonomy" when using predefined taxonomy (always prefer this if a good match exists)
- "generated" when using generated taxonomy (only if no good taxonomy match)
- "new" when creating new labels (only if neither taxonomy nor generated has a match)

Output Requirements:
Always return a complete LabelingResult with all fields filled.
The source field MUST accurately reflect where the data came from.

UPDATE:
I ran a few tests with 200-300 products per test

What you're seeing is Llama 4 Scout: ~300 products processed in 49 minutes via the OpenRouter Groq provider. Total cost: $0.213

Disclaimer: This is not traditional NER (Named Entity Recognition). Traditional NER models extract entities (like "Philips" or "9W") but don't inherently link them to a master record. This approach leverages the reasoning power of LLMs to mimic a human data analyst: searching, comparing, and making a decision to map the messy input to a canonical entry (Entity Resolution).

Bridging the Communication Gap: A Leap from Junior to Senior Software Engineer in the Indonesian Tech Scene

Frendhi — Sun, 15 Oct 2023 02:17:48 +0000

Congratulations!

Congratulations on your third year in software development and getting your 5th app into production. You know your programming language, framework, and tools well, which might make you feel like a senior software engineer now. However, moving from junior to senior requires more than just technical skills. If you've been wondering why the title of senior engineer eludes you despite your technical proficiency, you might be experiencing what's known as the Dunning-Kruger effect.

It's a common scenario, especially in the software engineering field, to feel like a master early on. Remembering the challenges faced while building my third system, my initial confidence faded when I realized how much more there was to learn. The tech scene is ever-evolving with trends and tools changing every six months, making the chase for technical skills seem endless. However, it's crucial to pause and think about what truly matters for advancing in your career.

The Underestimated Skill: Communication

Good communication is essential, regardless of your tech stack or the number of years you've been programming. It shapes how you interact with colleagues, managers, clients, and users. More than just talking, it encompasses sharing information, giving status updates, writing PR descriptions, discussing issues, risks, and estimations, whether verbally or in writing.

In the Indonesian tech scene, the communication dilemma is glaring. It's not hard to find a developer with a few years of experience in Laravel, WordPress, or NodeJS. However, many struggle to express their thoughts or questions clearly at work. This struggle extends beyond language barriers. Despite better English comprehension thanks to the internet, articulating thoughts professionally, even in our native Bahasa Indonesia, remains a challenge. This struggle manifests in various professional interactions, including describing an issue, writing a PR, or drafting technical documentation.

Even simple tasks like composing a git commit message become a part of team communication and should be taken seriously. The way junior engineers express themselves, especially in the Indonesian tech scene, often lacks clarity. This lack of clarity is evident in online forums, where junior engineers struggle to formulate clear questions, making it difficult for others to provide helpful answers. It's not about politeness but about the ability to convey one's intent or inquiry effectively. This difficulty in communication is not confined to online interactions but is a reflection of the challenges faced in real-world work settings. Senior engineers often find it hard to understand the juniors' intent or questions, which could hinder teamwork and project progress.

The challenges

IMO, the communication skill issue in the Indonesian tech scene stems from various challenges:

Preference for Visual Learning: Many Indonesians prefer watching demos or tutorials rather than reading. The tendency is to seek straight-to-the-point instructions on the "how" while often overlooking the "why" and the "what."
Language Barriers: Though Bahasa Indonesia is our national language, local languages reign supreme in daily communication. This preference can hinder understanding when discussing technical matters, as nuances may get lost in translation.
Disinclination Towards Writing: Many prefer direct verbal communication over writing. The act of composing sentences is seen as cumbersome and less productive compared to coding. There's a prevailing notion that producing working code is paramount, often at the expense of documenting the rationale (RFC or issue ticket/card), expected output (DOD), and the methodology (PR's description, code comments, git commit) behind the code.
Time Constraints: The tight deadlines common in Indonesian tech projects leave little room for pause and reflection, which are often wrongly perceived as unproductive. Yet, understanding and visualizing a problem, followed by finding the right solution, is as crucial as implementing it.

Proposing Solutions

Enhancing communication skills within the Indonesian tech community requires a concerted effort:

Cultivate Reading and Writing Habits: Engage in daily reading and frequent technical writing. Explore issues in open-source GitHub repositories and scrutinize senior engineers' PRs—not just the code, but also how they structure their PR and draft their commit messages.
Be the Change: Good communication should start with you. While you can't control others, demonstrating effective communication in your work can influence your peers to follow suit.
Learn from Seniors: Listen to senior engineers delivering technical explanations in tech talks available on platforms like YouTube. Analyze how they convey complex information clearly and effectively.
Broaden Your Perspective: Don't confine your growth to mastering coding and frameworks alone. Embracing a holistic approach towards being a well-rounded senior engineer is vital.
Leverage AI: LLM (Language Models) can be a great help! We are living in the future, guys. There's no need to fear embarrassment or muster the courage to ask something that might seem silly. Just type your question as a prompt to, say, ChatGPT, and ask it to compose a proper sentence for that question. Bam! You have your proofreading assistant and learn how to frame a better question along the way.

Bridging the Communication Gap

Addressing the communication gap is crucial for a smooth transition from a junior to a senior role in software engineering, especially in Indonesia. Balancing technical expertise with enhanced communication skills not only fosters personal growth but significantly contributes to elevating the professionalism and competitiveness of the Indonesian tech scene.