Your LLM Isn't the Problem. Your Pipeline Is.

Siarhei Dudzin — Wed, 22 Apr 2026 19:56:12 +0000

I'm building a SaaS product that automatically tags e-commerce products using LLMs. The implementation surfaced an architectural problem that isn't obvious until you're deep in it - and it's the same problem every production system at scale has quietly solved.

The assumption most engineers start with is reasonable: tagging is a classification problem, LLMs are good at classification, wire up a prompt and move on. The prompt works. The output looks correct. Then you run it across a catalog.

Product one gets "sci-fi." Product two gets "science fiction." Product three: "SF." Three synonyms getting three separate tags, with the taxonomy fragmenting from the first batch.

The problem isn't the model. It's what you're not giving it.

Each individual LLM call is correct. The model isn't hallucinating - it's doing exactly what you asked. The problem is that each call has no memory of what came before. You've handed the model a product and asked for tags, but the tag vocabulary itself is part of the input, and you're not providing it.

This is the architectural mistake: treating tagging as a per-product problem when it's a per-catalog problem.

Product attributes make it worse before they make it better. Title, description, category, weight, materials - these fields have heavy influence on which tags the model generates, and small differences in how they're structured produce divergent outputs even for semantically identical products. A product with a rich description gets different tags than the same product with a sparse one. The model is making the right inference from the data it has; the data just isn't consistent.

What production systems actually do

The naive single-call approach breaks at scale. Every company that has published how they solved this converged on the same thing: constrain the LLM's output to a controlled vocabulary.

DoorDash states it explicitly: ANN retrieval surfaces the top 100 taxonomy concepts as candidates, then the LLM selects from those. Hallucination rate under 1%. Free-form generation doesn't appear in the pipeline.

Shopify runs over 30 million predictions daily across 10,000+ categories. Two sequential calls to a vision-language model - the first predicts category, the second predicts category-specific attributes using the first call's output as context. Training data comes from a multi-LLM annotation system where several models independently evaluate each product and disagreements go to an arbitration layer. With all of that, they land at 85% merchant acceptance. That's what good looks like at this scale.

Amazon's dual-expert system takes a different angle. A fine-tuned domain model narrows the label space to a small set of candidates. Then a general LLM reasons over the differences between those candidates and picks the best match. The general LLM never classifies cold - it always operates against the candidate set from the first model. The combination outperforms the domain model alone.

The vocabulary is an input, not an output

Once you treat the tag vocabulary as a first-class input to every LLM call, the consistency problem largely goes away.

For smaller catalogs the approach is direct - inject the existing tag list into the prompt with an explicit instruction to prefer existing tags and only introduce a new one when nothing fits. Structured output handles the rest. Spring AI makes this straightforward: define the output schema, and the model returns something parseable rather than something creative. Set temperature near zero. You want deterministic selection, not variation.

For larger vocabularies, pre-filtering the injected list via embedding similarity against the product description keeps the prompt manageable without losing coverage.

The model's job is no longer to invent a taxonomy. It's to place a product within one that already exists and extend it conservatively when needed. That's a fundamentally different - and easier - task.

The cold-start problem nobody names

Here's the part that doesn't appear in most posts on this topic: every new store starts with zero tags. There's no vocabulary to inject and no taxonomy to constrain against.

This isn't a minor edge case - it's the state every new customer is in when they connect. And it's where the naive approach fully breaks down, because the fix for consistency (vocabulary injection) depends on having a vocabulary.

The approach that makes sense architecturally: seed first, bulk-classify second. Take a representative sample - enough to cover the catalog's main categories - run it through the pipeline without constraints, review and normalize the resulting vocabulary, then use that curated list as the constraint for bulk classification of the rest. You're building the taxonomy before you rely on it.

The human effort sits at the seeding stage, not spread across the full catalog. You review a vocabulary, not individual products.

In short

Product tagging looks like a prompt engineering problem. It isn't. The consistency problem is architectural. The fix is treating your tag vocabulary as a first-class input to every LLM call - and building a pipeline that maintains and enforces that vocabulary across a catalog, not just across a single product call.

The model is rarely the bottleneck. The pipeline around it usually is.