TL;DR: E-commerce product data is messy. Traditional aggregation fails because "Philips 9W" and "Lampu Philips 9 Watt" look different to a machine. We built CONGA, an AI agent using Pydantic AI, Gemini 2.0 Flash, and Meilisearch, to intelligently "resolve" these messy titles into a clean canonical data, just like a human analyst would, but at scale.
1. Problem
E-commerce product titles are often a mess. They aren't consistent and generally don't follow the correct official product names from the brand. Mapping them to the correct product type and brand is crucial to unlock valuable insights that otherwise can't be concluded from raw scraping data.
For instance, if we want to calculate the average price of "Philips LED Bulbs 9W", we can't simply aggregate the raw data because one seller might list it as "Lampu Philips 9W" while another lists it as "Philips LED Bulb 9 Watt". Without normalization, these are treated as two different products, making accurate analysis impossible.
Imagine having these 5 different titles from 5 different sellers:
- "Lampu Philips 9W Putih Murah Promo"
- "Philips LED Bulb 9 watt Cool Daylight"
- "Bohlam LED Philips 9W E27 6500K"
- "PHILIPS ESSENTIAL 9W PUTIH"
- "Lampu Hemat Energi Philips 9 Watt Original"
To a human, it's obvious these are all the same "Philips Essential LED Bulb 9W". But to a computer? They are 5 unique strings. Connecting these variations to the single master record is the goal.
2. Idea
If we have a complete taxonomy of the brand's products, we could just match them one by one, right? Seems like a pretty simple thing to do, given some brands might not have hundreds or thousands of unique products. But when we have thousands of incoming raw data points to map, scaling becomes a problem.
Why can't we just delegate this to an LLM? Again, seems like a pretty simple thing to do, right? And in fact, yes, we can.
Enter CONGA (Commerce Labeling Agent). The idea is to create an agent that can intelligently look up products in our master taxonomy, and if it finds a match, link it. If it doesn't find a match, it should identify it, normalize the name, and "learn" it for next time.
3. Implementation
The Design
The tech stack is straightforward but powerful:
- Framework: Pydantic AI - This simplifies the agentic workflow A LOT.
- LLM Provider: Google Vertex AI (specifically Gemini 2.0 Flash for speed and cost-efficiency), any other provider should works as long as pydantic-ai supports it.
- Search Engine: Meilisearch - Used to index the master taxonomy for lightning-fast fuzzy searching.
- Storage: Any storage or database will do, in my case I use s simple generated taxonomy (SQLite/JSON) to store "new" products the agent finds.
One of the key components is the create_agent function which initializes the agent with the system prompt and tools:
def create_agent(
system_prompt: Optional[str] = None,
extra_toolsets: Optional[Sequence[AbstractToolset[AgentDependencies]]] = None,
model_name: str = 'gemini-2.0-flash',
verbose: bool = False
) -> Agent:
"""Create CONGA agent with custom or default system prompt."""
prompt = system_prompt or load_system_prompt()
# Initialize Google model
model = GoogleModel(model_name)
toolsets: list[AbstractToolset[AgentDependencies]] = []
toolsets.extend(build_env_fastmcp_toolsets())
if extra_toolsets:
toolsets.extend(extra_toolsets)
return Agent(
model,
output_type=LabelingResult,
system_prompt=prompt,
deps_type=AgentDependencies,
tools=[search_taxonomy, search_generated_taxonomy, get_taxonomy_context],
toolsets=toolsets
)
And here is how we implement the search_taxonomy tool that allows the agent to query our Meilisearch index:
def search_taxonomy(ctx: RunContext[AgentDependencies], query: str, category: Optional[str] = None) -> str:
"""Search for products in the predefined taxonomy."""
print(f" π Searching taxonomy for: '{query}'" + (f" (category: {category})" if category else ""))
if ctx.deps.use_meilisearch and ctx.deps.meilisearch_tool:
# Use Meilisearch
matches = ctx.deps.meilisearch_tool.search(query, category=category)
if not matches:
return "No matches found in taxonomy."
result = f"Found {len(matches)} matches in taxonomy:\n"
for match in matches:
result += f"- Normalized Name: {match.get_normalized_name()}\n"
result += f" Brand: {match.brand}\n"
result += f" Category: {match.category}\n"
if match.sku_type_complete:
result += f" SKU Type: {match.sku_type_complete}\n"
result += "\n"
else:
# Fallback to simple search...
pass
return result
How it Works
The agent is given a specific goal: map the raw product title to the master taxonomy. We give it two main tools:
-
search_taxonomy: To query the official Meilisearch index. -
search_generated_taxonomy: To check if we've already seen this "new" product before.
The flow is:
- Analyze: The agent reads the raw title.
- Search Master: It queries the Meilisearch index.
- Decision:
- If a high-confidence match (>0.7) is found, use it.
- If NOT found, check the "generated" storage.
- If still not found, create a new label and flag it as
source="new".
The Challenge: Prompt Engineering
Since the tech stack is pretty simple, what's the challenge? It's the System Prompt. This turned out to be more of a prompt engineering challenge than a complex ML Ops problem.
The key was telling the LLM exactly how to search. We couldn't just say "find this product." We had to give it a Search Strategy.
In the final system prompt, we explicitly instructed it to try multiple queries in order:
- FIRST TRY: Brand + Product Line + Wattage (e.g., "Philips Essential LED 9W") - Most specific.
- IF NO MATCH: Brand + Wattage only (e.g., "Philips 9W") - Catches variations.
- IF NO MATCH: Brand + Product Type (e.g., "Philips LED Bulb") - Broad search.
We also added Critical Decision Logic:
"If NOT lighting (e.g., smartphone, laptop), skip taxonomy search β source = 'new'"
By forcing the agent to be persistent and systematic, the match rate improved dramatically.
That alone is a pretty good result, but we can do better if we have a clear understanding of the product's category nuance. As in the case of lighting products, most of the products must have like power, color temperature, color, etc. as a keyword. Adding this contexts to the system prompt will help the agent to determine how to search for the product.
4. The Data
To really test this, we need two things: a source of truth (Taxonomy) and a mess of data (Input).
The Taxonomy (Source of Truth)
This is what our "Golden Record" looks like. It's a clean list of products with their attributes.
| Product Name | Brand | Category | Attributes |
|---|---|---|---|
| Philips RadiantLine LEDBulb 5W 6500K | Philips | LED Bulb | 5W, 6500K |
| Philips Essential LEDBulb 9W 6500K | Philips | LED Bulb | 9W, 6500K |
The Input (The Mess)
And this is what we're trying to fix. Real-world raw titles from scraping:
| Raw Input Title |
|---|
| Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips |
| PHILIPS LED BULB 7W RADIANTLINE 3000K WARM WHITE LAMPU HEMAT ENERGI |
| Samsung Galaxy A54 5G Ram 8/256 GB Garansi Resmi SEIN Indonesia |
| Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik |
| iPhone 14 Pro Max 256GB Deep Purple Apple Garansi iBox |
5. Result
The results are surprisingly good. I'm satisfied with it actually. Modern days LLMs are really great at this logic-heavy text processing.
The agent returns a structured LabelingResult like this:
{
"original_title": "Lampu Philips RadiantLine LEDBulb 5W 6500K Putih Lampu LampuLEDPhilips",
"normalized_name": "Philips RadiantLine LEDBulb 5W 6500K",
"brand": "Philips",
"category": "LED Lamps",
"confidence": 0.95,
"source": "taxonomy"
}
Real-World Test
Here is a snippet from our actual experiment log showing the agent in action. Notice how it finds high-confidence matches in the taxonomy for known products (Hannochs, Avaro) but correctly identifies a new product (Philips LED Strip) when the exact SKU isn't in the master record.
[1/100] Processing: Hannochs Lampu Bohlam LED Vario 45W Cahaya Kuning
π Searching taxonomy for: 'Hannochs LED 45W' (category: LED Lamps)
β SUCCESS
Normalized Name: Hannochs LED Vario 45 W
Confidence: 90.00%
Source: taxonomy
[2/100] Processing: [Smart ] Avaro Smart Led Light Bulb Rgbww 10W Bluetooth Wireless Iot [Lamp]
π Searching taxonomy for: 'Avaro Smart Led Light Bulb 10W'
β SUCCESS
Normalized Name: Avaro Smart LED Bluetooth 10 W
Confidence: 90.00%
Source: taxonomy
[3/100] Processing: TERBARU PHILIPS SMART WIFI LED STRIP STARTER KIT 2M - COLOR & TUNABLE T0306
π Searching taxonomy for: 'Philips LED strip 2M'
β SUCCESS
Normalized Name: Philips Smart Wi-Fi LED Strip 2M
Confidence: 60.00%
Source: new
Here is a sample of how it handles different scenarios:
| Raw Input Title | Normalized Name | Brand | Source | Confidence |
|---|---|---|---|---|
| Lampu Philips LED 9W Putih Murah | Philips Essential LED Bulb 9 W | Philips | taxonomy | 0.95 |
| Hannochs 10 watt cahaya putih | Hannochs LED Bulb 10 W | Hannochs | generated | 0.82 |
| Smart LED Bulb Bardi 12W RGBWW | Bardi Smart LED Bulb 12W RGBWW | Bardi | new | 0.45 |
| iPhone 15 Pro Max 256GB | iPhone 15 Pro Max 256GB | Apple | new | 0.10 |
And the cost? Not concerning at all with Gemini Flash. It's a pretty simple solution for a pretty simple job, but it solves a real data quality headache.
Here's the system prompt:
You are CONGA (Commerce Labeling Agent), an AI specialized in labeling and normalizing product titles from e-commerce platforms.
IMPORTANT: The taxonomy database is FOCUSED ON LIGHTING PRODUCTS ONLY (LED bulbs, lamps, etc.).
If a product is NOT a lighting product (e.g., smartphones, laptops, etc.), it will NOT be in the taxonomy.
For non-lighting products, you should create a new label (source = "new").
Your task is to:
1. Analyze product title provided
2. Search predefined taxonomy for matches FIRST (only for lighting products)
3. If a good match is found in taxonomy (confidence > 0.7), use it and STOP - do NOT search generated taxonomy
4. Only search generated taxonomy if NO good taxonomy match is found
5. Extract and normalize product information
TAXONOMY DATA STRUCTURE:
When you search the taxonomy, each entry has these fields:
- brand: e.g., "Philips"
- category: e.g., "LED Lamps"
- product_name: e.g., "Philips Essential LED Bulb 9 W" (THIS IS THE NORMALIZED SKU NAME - use this!)
- sub_brand: e.g., "Philips Essential LED Bulb"
Example taxonomy entry:
{
"brand": "Philips",
"category": "LED Lamps",
"product_name": "Philips Essential LED Bulb 9 W",
"sub_brand": "Philips Essential LED Bulb"
}
Note: sku_type_complete is the authoritative normalized product name from the taxonomy.
SEARCH STRATEGY - TRY MULTIPLE APPROACHES:
The taxonomy search is fuzzy and powerful. You MUST try multiple search queries before giving up.
For a product like "Philips Essential LED 9W Daylight 6500K Bohlam Lampu Hemat Listrik":
1. FIRST TRY: Brand + Product Line + Wattage
- Search: "Philips Essential LED 9W"
- This is the most specific and usually works best
2. IF NO MATCH: Brand + Wattage only
- Search: "Philips 9W"
- Catches products where product line name varies
3. IF NO MATCH: Brand + Product Type
- Search: "Philips LED Bulb"
- Broader search to find any similar products
4. IF NO MATCH: Just Brand
- Search: "Philips"
- See what products exist from this brand
IMPORTANT RULES:
- Try AT LEAST 2-3 different search queries before giving up on taxonomy
- DO NOT add category filters to search queries - let Meilisearch handle relevance
- Extract the sku_type_complete EXACTLY as returned from taxonomy
- Only mark as "new" if you've tried multiple searches and found nothing relevant
Guidelines:
- Extract brand names accurately regardless of industry
- Identify correct product category based on product characteristics
- Normalize product names to be clean, consistent, and professional
- Preserve key technical specifications (wattage, model numbers, sizes, etc.)
- For non-lighting products, set source = "new" since they won't be in the lighting taxonomy
CRITICAL DECISION LOGIC:
1. Identify if the product is a lighting product:
- If NOT lighting (e.g., smartphone, laptop), skip taxonomy search β source = "new"
- If lighting product, proceed to search taxonomy
2. Search taxonomy PERSISTENTLY (for lighting products only):
- Try MULTIPLE search queries (see SEARCH STRATEGY above)
- Start specific, then broaden: brand+line+wattage β brand+wattage β brand+type β brand only
- If ANY search finds a good match (confidence >= 0.7):
* Use the taxonomy result
* Set source = "taxonomy"
* Set sku_type_complete = EXACTLY the "sku_type_complete" value from the taxonomy entry
* DO NOT search generated taxonomy
- Only proceed to step 3 if ALL search attempts fail
3. If taxonomy has NO good match after multiple tries:
- Search generated taxonomy
- If found, set source = "generated"
- Set sku_type_complete = None (generated taxonomy doesn't have SKU types)
4. If neither has a match:
- Create a new label
- Set source = "new"
- Set sku_type_complete = None
REMEMBER: For lighting products, you should try AT LEAST 2-3 different taxonomy searches before giving up!
Confidence Scoring Rules:
- High confidence (>0.8) when matching existing taxonomy
- Medium confidence (0.5-0.8) when matching generated taxonomy
- Lower confidence (<0.5) when creating new labels
Source Classification:
- "taxonomy" when using predefined taxonomy (always prefer this if a good match exists)
- "generated" when using generated taxonomy (only if no good taxonomy match)
- "new" when creating new labels (only if neither taxonomy nor generated has a match)
Output Requirements:
Always return a complete LabelingResult with all fields filled.
The source field MUST accurately reflect where the data came from.
Disclaimer: This is not traditional NER (Named Entity Recognition). Traditional NER models extract entities (like "Philips" or "9W") but don't inherently link them to a master record. This approach leverages the reasoning power of LLMs to mimic a human data analyst: searching, comparing, and making a decision to map the messy input to a canonical entry (Entity Resolution).
Top comments (0)