ThomasP

Posted on Mar 17 • Originally published at getmio.app

Why finding where a product is made is an AI problem

#ai #machinelearning #beginners #webdev

A barcode tells you where a product was registered. Not where it was made.

Pick up any product at the grocery store. Flip it over. See that barcode? The first three digits tell you which country it was registered in. Starts with 300-379? France. 400-440? Germany. 890? India.

Most people (including me, before I started working on this) assume that's where the product is manufactured.

It's not. Not even close.

A French brand can register barcodes in France and make everything in China. A German company can produce in Poland. That 3-digit GS1 prefix matches the actual manufacturing country about 40% of the time. Basically a coin flip.

I'm building Mio, an app that lets you scan a barcode and find where a product is actually made. What I thought would be a fun database project turned into one of the most interesting AI engineering challenges I've worked on. Here's why.

Fair warning: I'm a developer, not a data scientist. Some of what I'll share in this series (variance exists! you need more than 3 test cases!) will make ML engineers smile. But most devs building with LLMs right now don't have a stats background, and I suspect they're running into the same walls I did. If my trial-and-error saves someone a few days, that's enough.

The data exists. Good luck finding it.

Here's the thing that makes this problem so sneaky: the manufacturing origin of most products is available somewhere. It's buried in a retailer's product page. It's in an open database. It's on the packaging in 6pt font. It's implied by a quality label that legally requires a specific region.

But "somewhere" is doing a lot of heavy lifting in that sentence.

It's fragmented. One database has the product name but no origin. A retailer page has "Country of manufacture: Germany" buried in the specs tab nobody clicks. An open food database has a sanitary registration code that implies a packaging location, which may or may not match where it was made.

It's inconsistent. "Made in France." "Fabriqué en France." "Pays de fabrication : FR." "Lieu de production : Normandie." Same information, a dozen different formats across languages and conventions.

It's actively misleading. And this is where it gets really fun. A retailer might display "Fabriqué en France" as a site-wide promotional banner , not a statement about the specific product you're looking at. Amazon might show "Country of Origin: China" for the seller's account, not the product. A brand's website proudly states "French since 1921" while manufacturing in Italy through a parent company nobody's heard of.

This is not a database lookup problem. This is a reasoning problem.

The obvious approaches. We tried them all.

"Just build a database." We did. We integrated a product database covering 69 million items. It has names, brands, categories, labels, and for some products, manufacturing origin. When that field is populated, it's rock solid. Problem: it's populated for maybe 15-20% of products. The other 80% give you a name and brand, but no origin.

"Just scrape retailer sites." Tried that too. Some retailers do list manufacturing origin in product specs. But not all products, not all retailers, and the HTML structure varies wildly. A static scraping pipeline breaks every time someone redesigns a product page. Which is constantly.

"Isn't this regulated?" In the EU, manufacturing country isn't mandatory on most product labels. Food is somewhat better covered, but even food has exceptions. And regulatory databases, when they exist, are rarely machine-readable.

Each approach alone tops out at ~20-30% coverage. And none of them can tell you how confident to be. A direct "Made in Germany" statement on the manufacturer's website is a completely different signal than inferring "probably Germany" because the brand is German and the barcode prefix is 400.

Why this is actually a reasoning task

Here's the insight that changed everything for us: finding where a product is manufactured is a multi-step reasoning task with uncertain evidence.

A real example from our benchmark: a toothbrush with a French barcode, sold by a brand founded in France in 1921, now owned by an Italian conglomerate. First web search returns a retailer page showing "Fabriqué en France", but is that about this product, or a promotional banner for the retailer's French-made product line? A second result shows the parent company runs factories in Italy, Poland, and France. An open database has no manufacturing data but lists a sanitary code starting with "IT", suggesting Italian packaging.

To actually figure this out, you need to:

Search across multiple sources, in multiple languages
Actually read the pages, not just search snippets, to verify that "Made in X" refers to this specific product
Cross-reference: does the sanitary code match the web sources? Does the corporate ownership explain the discrepancy?
Calibrate confidence. Is this verified or an educated guess?
Know when to stop. Some products can't be traced from public sources. "Unknown" beats a wrong answer every time.

This is textbook AI agent territory. Not a single LLM call. Not RAG. An agent that decides what to do next based on what it's found so far.

In practice, this is what it looks like: you scan a product in a store, and within a few seconds you get the manufacturing country, a confidence level, the reasoning behind it, and links to the sources.

The architecture (high level)

The system follows a simple priority chain:

Database first: when a structured database has the origin with high confidence, return it instantly. No LLM needed. This handles ~15-20% of queries in milliseconds.
Agent for the rest: an LLM agent with access to web search and page reading, tasked with finding and verifying the manufacturing country. It searches, reads pages, cross-references, and submits an answer with a confidence level.
Confidence as a first-class output: every result comes with "verified" (explicit source), "probable" (indirect evidence), or "low" (couldn't find much). This distinction matters more than the country itself for user trust.

The agent can dynamically decide: search with different keywords, read a promising page, try a different language, or bail and report low confidence. That adaptive loop is the whole point.

There's an important constraint though: this runs in real time. A user scans a product in a store and expects an answer in seconds, not minutes. And every web search, every page read costs money. So the system needs to be accurate, fast, and cheap. A model that gets 5% more answers right but costs 5x more per scan and takes 30 seconds instead of 10 isn't viable for a consumer app. Finding the right balance between accuracy, cost, and latency turned out to be as hard as the accuracy problem itself.

Five traps that will wreck your accuracy

Building this system taught me things no tutorial or documentation covers. Here are the failure modes that cost us the most time:

1. The GS1 prefix trap

The agent sees a barcode starting with 300 (France) and subconsciously anchors on France, even when the evidence points elsewhere. We had to explicitly break this: "The barcode prefix is where the brand is registered. It is NOT evidence of manufacturing origin." Without this, the agent has a strong France bias. 5 out of 7 false confidence cases in our first benchmark were the agent saying "France" when it was wrong.

2. The brand ≠ factory trap

Moulinex is a French brand. It manufactures in China, Poland, and France depending on the product line. Our agent confidently said "Made in France" for products manufactured on a different continent, because the brand's Wikipedia page says "French company." "French brand" is not "French product." Obvious in hindsight. Not obvious to an LLM.

3. The retailer badge trap

This was our number one source of false confidence. Some retail websites show origin-related badges ("Made in France," "Produit local") as promotional elements across their entire site. These show up in search snippets right next to the product listing. The agent can't distinguish a product-specific statement from a marketing banner without actually reading the full page.

We had cases where the agent stated "verified: Made in France" based on a badge that applied to a completely different product line on the same retailer site. Brutal.

4. The "EU" trap

Many products say "Made in EU." Technically correct, practically useless. 27 member states. We spent a week trying to handle this at the model level. The model completely ignored our instructions across every prompt version we tried. Sometimes the right answer is to accept the limitation.

5. Packaging ≠ manufacturing

Sanitary registration codes (EMB codes) tell you where a product was packaged, not where it was manufactured. A product made in Spain can be packaged in France and carry a French code. The data looks authoritative, which is exactly what makes it dangerous.

What actually matters (after 108 benchmark runs)

We ran 108 benchmarks over three weeks. Seven models from four providers. Six major prompt versions with dozens of sub-variants. A golden dataset we hand-curated from 21 items to 57, adding harder cases as the easy ones stabilized. Every single run was measured against ground-truth labels, with the prompt version and git SHA recorded on each trace in Langfuse for full reproducibility.

We went from 42% accuracy to 78%. Here's what crystallized:

False confidence is the metric that matters. Not accuracy. A system that says "I don't know" when it doesn't know is infinitely more trustworthy than one that answers everything but is wrong 15% of the time. We call it "false confidence": the agent says "verified" and it's wrong. That's the number we optimize against above all others.

The information quality hierarchy is steep. A structured database field is gold. An explicit "Made in X" on the manufacturer's website is silver. A retailer listing with origin in the specs is bronze. A search snippet mentioning a country near a product name is lead, heavy and potentially toxic. We learned this the hard way.

Optimization is three-dimensional. We iterated on prompts, tooling, and models, and the interactions between the three are what matter. Prompt rules that failed on a smaller model worked perfectly on a smarter one. Parallel tool execution only helped because the model was smart enough to batch calls and the prompt told it to. We doubled search results from 5 to 10 per query and accuracy dropped, not because "more is bad" but because that model couldn't handle the noise. The best results came from finding the right combination across all three axes.

Intellectual honesty is non-negotiable. We're not auditing factories. We're not certifying supply chains. We're aggregating publicly available information, assigning a confidence level, and presenting it transparently. If a brand lies on its website, we'll relay that lie, and the confidence system will reflect how many independent sources confirmed it. Being clear about what the system can and cannot do is both the ethical choice and the one that builds the most trust.

This pattern is everywhere

The reason I'm writing this up is that this problem structure is way more common than people realize:

The answer exists somewhere in public sources
No single source is reliable on its own
The reasoning path depends on what you find at each step
Confidence calibration is as important as the answer itself
The problem looks trivially solvable until you actually try to automate it

These are the problems where AI agents genuinely earn their keep. Not because any individual step is hard (searching, reading a webpage, comparing two strings) but because orchestrating those steps requires judgment. When to search again, when to read the full page, when to accept the evidence, when to give up.

108 benchmark runs, 7 models, 6 prompt versions, 3 weeks. The journey from "this kind of works" to "this is reliable enough to ship" was far more interesting, and far more counterintuitive, than I expected. Prompt rules that failed on one model worked on another. Changes I'd written off as failures turned into wins in a different context. The biggest gains came from places I didn't expect.

That's what I gonna tell in the rest of this series.

Next up: Why we built the evaluation framework before writing a single line of prompt. And why "it seems better on a few examples" is the most dangerous sentence in AI engineering.

I'm building Mio, an app that surfaces manufacturing origin from product barcodes. 108 runs, 7 models, a hand-curated golden dataset, and an LLM-as-judge system reviewing the agent's work. If you've built evaluation pipelines for AI agents or dealt with similar multi-source reasoning problems, I'd love to hear about your experience.

DEV Community