DEV Community

Kevin Naidoo
Kevin Naidoo

Posted on

Solve a real-world problem with AI

I use AI for a lot of stuff, from a coding assistant, sometimes using Claude Code, to complex voice AI agents.

One thing is very clear: the future is a mixture of small and large models, and not all the hype you find on YouTube and other social media sites. In this article, let's look at a real-world problem and how you can use a small model to solve it.

Problem

I need to match my local categories with those of Google taxonomies to safely import e-commerce products into my system.

We have a large table of 5000+ local categories and merchant products that are categorized using Google Shopping Taxonomies.

Now, obviously, the system is a legacy product and is built around our categories. It's not feasible to just dump the current category trees and replace them with Google's.

I need to still import those products into the system, though; a merchant could have thousands of products, and the total product count collectively ends up in millions.

It's not possible for humans to check and verify each one. Furthermore, asking a big model like Claude or Gemini 2.5 pro is going to get crazy expensive if I have to categorize every single product.

Don't know what to do

Solution

Step 1: Vector embedding

Take all the 5000 local categories and index them into a vector store using text-embedding-3-large and build a thin microservice using FastAPI or whatever backend you like.

The service would have 2 endpoints, i.e:

  1. api/category/vectorize ~ This is the endpoint that will take a category name and calculate the vector embedding, and store it in a vector database like QDrant or Postgres PGVector.

  2. api/category/search ~ takes a search term, then does a cosine similarity search for the closest matching category with a score of at least 80%.

Then you simply need some sort of console job or ETL process that will make an API request when a product comes in, to get back a suggested local category.

You shouldn't update the product immediately; instead, store the results in a queue table.

Step 2: Verification

Use a cheaper model like GPT-4o mini or Ollama models like qwen2.5:7b to verify that the category matched with the vector embeddings was accurate.

A simple prompt

Task: Verify if the product category match is correct.

Product: [PRODUCT_TITLE] - [PRODUCT_DESCRIPTION]
Matched Category: [LOCAL_CATEGORY]
Google Shopping Category: [GOOGLE_SHOPPING_CATEGORY]

Does the matched category accurately represent this product?

Respond with only: <agree> or <disagree>
Enter fullscreen mode Exit fullscreen mode

Step 3: Build a human-verified database

In this step you want to manually audit samples by humans and put the result in a verified database. This will obviously take some time, but once you get to 10k or so of data it should be okay.

It's important to sample across various categories and different product types.

Store a cleaned dataset of this data by vectorizing the product titles and storing them similarly to step one.

Step 4: Audit results

Now add a step before step 1. I.e., first check the verified database to see if the current product title matches more than 80% with something in that database.

If it does, you take that product's category.

You should still run through step 2 as well, however, pass through results from the product search such that it's provided as few shot examples to better help the model determine the accuracy.

Top comments (0)