Why Regex Fails at Google Taxonomy: Building a 98% Accurate RAG Agent

CatMap — Mon, 15 Dec 2025 06:42:49 +0000

The Problem: "Is a 'Hot Dog' a Dog?" 🌭

In Google Merchant Center, categorization is everything. If you misclassify a product, your ads stop running.

Most feed tools use keyword matching (Regex).

Rule: If title contains "Dog" -> Category: Animals > Pets > Dogs
Input: "Hot Dog Costume"
Result: Animals > Pets > Dogs ❌ (Wrong!)

This is why 15-20% of products in large catalogs often sit in "Disapproved" purgatory.

The Solution: Retrieval Augmented Generation (RAG) 🧠

I built CatMap AI to solve this using Vectors, not Keywords.

1. The Architecture

Instead of rules, we convert the entire Google Product Taxonomy (5,500+ nodes) into a Vector Index using OpenAI's text-embedding-3-small.

When a product comes in ("Pallash Casual Women's Kurti"), we don't look for the word "Kurti". We look for the mathematical concept of the product in vector space.

2. The "Smart Retry" Pattern 🔄

Here is where it gets interesting. Standard Vector Search fails on cultural terms.

Input: Kurti
Vector Match: Generic Clothing (Confidence: Low)

To fix this, we implemented an Agentic Loop:

Attempt 1: Standard Search. Result: Uncategorized.
Trigger: Agent detects failure.
Action: Agent calls an LLM (gpt-5-nano) to "expand" the query.
- Prompt: "What is a Kurti? Give me synonyms."
- Response: "Tunic, Blouse, Shirt".
Attempt 2: Vector Search with "Tunic Blouse Shirt".
Result: Apparel > Clothing > Shirts & Tops. ✅

3. The Stress Test 📉

We ran this system against 2,000 real-world edge cases.

Coverage: 100% (Up from 85%).
Accuracy: 98.3%.
Time per Row: ~200ms.

Code Snippet (The Retry Logic)

// Simplified Logic
if (result.status === "Uncategorized") {
    const synonyms = await expandQuery(product.name); // AI Call
    const newContext = await VectorStore.search(synonyms);
    return categorizeWithContext(product, newContext);
}

Conclusion

Regex is dead for categorization. Context-aware AI is the only way to handle the complexity of modern e-commerce catalogs.

If you want to test the API, I'm opening a Free Beta for developers. Link to CatMap AI

Follow me for more Engineering Deep Dives into AI Agents.

How I processed 2,000 concurrent OpenAI requests using Node.js Streams (Zero 429 Errors)

CatMap — Thu, 04 Dec 2025 13:19:26 +0000

I recently built a backend engine to solve a boring but massive problem in e-commerce: Taxonomy Mapping.

Watch the demo test:

The goal was simple: Take a messy CSV of 20,000 products and map them to the official Google Taxonomy IDs using an LLM.

The problem? Rate Limits.

If you try to Promise.all() 2,000 requests to OpenAI, three things happen:

Memory Spike: Loading a 15MB+ CSV into a variable kills the Node process.
429 Errors: OpenAI bans you for hitting the Request Per Minute (RPM) limit instantly.
Error Collapse: Promise.all fails fast if one request fails, ruining the whole batch.

Here is the architecture I built to process 450+ requests per minute reliably using Node.js Streams and Bottleneck.

1. The Memory Problem (Streams vs. Arrays)

Loading a large CSV into memory is a rookie mistake. I switched to fs.createReadStream combined with csv-parser. This allows us to pipe the data row-by-row, keeping memory usage almost flat regardless of file size.

javascript
const fs = require('fs');
const csv = require('csv-parser');

const stream = fs.createReadStream(inputFilePath)
  .pipe(csv())
  .on("data", (row) => {
     // Push job to the limiter (see next section)
     // RAM usage stays constant even with 500MB files
     limiter.schedule(() => processRow(row));
  });

2. The Rate Limit Problem (Bottleneck)

This was the hardest part. OpenAI's Tier 1 limits are strict (Requests Per Day and Requests Per Minute). I needed a queue system that was "Aware" of time.

I used the bottleneck library to enforce a strict "Speed Limit" that is aware of concurrency.

Target Speed: ~450 RPM (Requests Per Minute) to stay safe.
Calculation: 60,000ms / 450 ≈ 133ms delay.
Concurrency: We allow 10 concurrent requests so we don't lose time waiting for network latency.

javascript
const Bottleneck = require("bottleneck");

// Configure the limiter
const limiter = new Bottleneck({
  minTime: 133, // Wait 133ms between launching requests
  maxConcurrent: 10 // Allow 10 active connections to handle latency
});

// Wrap the AI call
const task = limiter.schedule(async () => {
   return await callOpenAI(row);
});

3. Handling "Fatal" vs "Minor" Errors

When processing thousands of rows, you don't want to stop if one row fails (e.g., bad encoding). But you do want to stop if you run out of API Credits or hit a hard daily limit.

We implemented a custom error handling logic where the agent throws specific FATAL_ error codes, which the queue listener catches to stream.destroy() immediately.

javascript
// Simplified Logic
limiter.schedule(async () => {
  try {
     return await agent(row);
  } catch (e) {
     if (e.message.startsWith("FATAL_")) {
        // Kill the queue immediately so we don't waste retries
        limiter.stop({ dropWaitingJobs: true });
        stream.destroy(); 
        console.error("🛑 Queue Killed: " + e.message);
     }
  }
});

4. Context-Aware Prompting

Even with the architecture fixed, LLMs have a habit of hallucinating IDs. If a product description says "100% Cotton," the model might return 100 as the ID.

We solved this using Negative Constraints and Few-Shot Prompting to force strict integer validation against the 2024 Taxonomy standard.

The Result

We ran a stress test yesterday against a raw dataset of unorganized products:

Input: 2,000 Unorganized SKUs (15MB CSV).
Throughput: ~450 RPM (Requests Per Minute).
Errors: 0 Rate Limit Errors (429s).
Time: ~4.5 Minutes total.
Accuracy: 100% Valid Integer IDs (No text hallucinations).

By combining Node.js Streams for memory management and Bottleneck for flow control, we turned a script that crashed at 500 rows into an engine that handles 50k rows effortlessly.

🚀 We just launched on Product Hunt!

I wrapped this engine into an API called CatMap.

It’s live on Product Hunt today. If you want to test the speed yourself (or try to break it with a messy CSV), we just opened the Public Demo Key.

Check it out here (and I'd love your support!):
CatMap API on Product Hunt 🚀

Let me know in the comments if you have questions about the Node.js implementation or the prompting strategy!

DEV Community: CatMap