Why Agentic Loops Might Not Be For You

Siarhei Dudzin — Sun, 14 Jun 2026 12:02:36 +0000

Everyone is talking about agentic loops right now. Far fewer people can tell you what one actually is.

The concept is simple. You hand the agent a goal and it keeps working, turn after turn, deciding for itself when the job is done. Claude Code's /goal is a clean example: it iterates, and a separate model checks each turn against the goal until it passes. People also point to /loop and "routines", but /loop is closer to scheduled automation, re-running a prompt on a timer. Useful, just not the same animal as an agent reasoning its way to a finish line.

I've been running these loops on real work, and when the conditions are right the results are really impressive. The catch is that for most teams, most of the time, the conditions are not right.

So when is a loop actually worth it?

Experimentation and prototyping. When you're exploring and the details don't matter yet, letting the agent fill the gaps is the whole point. You want speed and direction. If it picks a slightly wrong library or names things oddly, who cares, you're throwing most of it away anyway.

When your token budget is effectively unlimited. Loops are expensive by design. The agent re-runs, re-reads, and burns credits on every turn it isn't sure about. A task a senior dev finishes in 20 minutes can cost a small fortune in tokens because the loop keeps second-guessing itself. If that doesn't scare you, fine. For most teams it should.

Maintenance work behind solid guardrails. This is the one people oversell. The pitch goes: you have a test suite and design docs, so the loop can run safely. Sometimes true. But a green test suite tells you the tests pass. It doesn't tell you the behavior is correct. If your guardrails don't actually encode intent, the loop will optimize for the wrong target and report success with a straight face. A loop is only as good as the fence you build around it.

This is where a good harness matters. The loop needs a way to check its own work, and the stronger that self-verification, the better the loop runs. For anything with a UI, the harness has to include visual checks. A loop that can see what it rendered catches whole classes of mistakes a green test run sails right past.

Look at those three again: an unlimited budget, work you'll throw away, and guardrails that truly capture intent. That's a narrow slice of real work. Miss all three and the loop is probably not for you, which is most real work.

Skip the loop when details matter. If the exact behavior, the edge cases, or the API shape are load-bearing, an agent guessing is exactly the failure mode you're trying to avoid. Keep a human in the loop and specify those parts yourself.

Skip it when the budget is real. Most teams have a cap. A loop that costs 10x a human for a predictable task is a bad trade, no matter how autonomous it feels.

And skip it on the critical path. For the parts of the system where a wrong decision is expensive, the human is the point. Take the human out and you've removed the one thing keeping the loop honest.

In short: agentic loops shine when being wrong is cheap and being fast is valuable. The moment correctness, cost, or risk enters the picture, the loop stops being a shortcut and turns into a liability. For most teams, most of the time, that's the situation you're in. So loops might not be for you, at least not yet.

Your LLM Isn't the Problem. Your Pipeline Is.

Siarhei Dudzin — Wed, 22 Apr 2026 19:56:12 +0000

I'm building a SaaS product that automatically tags e-commerce products using LLMs. The implementation surfaced an architectural problem that isn't obvious until you're deep in it - and it's the same problem every production system at scale has quietly solved.

The assumption most engineers start with is reasonable: tagging is a classification problem, LLMs are good at classification, wire up a prompt and move on. The prompt works. The output looks correct. Then you run it across a catalog.

Product one gets "sci-fi." Product two gets "science fiction." Product three: "SF." Three synonyms getting three separate tags, with the taxonomy fragmenting from the first batch.

The problem isn't the model. It's what you're not giving it.

Each individual LLM call is correct. The model isn't hallucinating - it's doing exactly what you asked. The problem is that each call has no memory of what came before. You've handed the model a product and asked for tags, but the tag vocabulary itself is part of the input, and you're not providing it.

This is the architectural mistake: treating tagging as a per-product problem when it's a per-catalog problem.

Product attributes make it worse before they make it better. Title, description, category, weight, materials - these fields have heavy influence on which tags the model generates, and small differences in how they're structured produce divergent outputs even for semantically identical products. A product with a rich description gets different tags than the same product with a sparse one. The model is making the right inference from the data it has; the data just isn't consistent.

What production systems actually do

The naive single-call approach breaks at scale. Every company that has published how they solved this converged on the same thing: constrain the LLM's output to a controlled vocabulary.

DoorDash states it explicitly: ANN retrieval surfaces the top 100 taxonomy concepts as candidates, then the LLM selects from those. Hallucination rate under 1%. Free-form generation doesn't appear in the pipeline.

Shopify runs over 30 million predictions daily across 10,000+ categories. Two sequential calls to a vision-language model - the first predicts category, the second predicts category-specific attributes using the first call's output as context. Training data comes from a multi-LLM annotation system where several models independently evaluate each product and disagreements go to an arbitration layer. With all of that, they land at 85% merchant acceptance. That's what good looks like at this scale.

Amazon's dual-expert system takes a different angle. A fine-tuned domain model narrows the label space to a small set of candidates. Then a general LLM reasons over the differences between those candidates and picks the best match. The general LLM never classifies cold - it always operates against the candidate set from the first model. The combination outperforms the domain model alone.

The vocabulary is an input, not an output

Once you treat the tag vocabulary as a first-class input to every LLM call, the consistency problem largely goes away.

For smaller catalogs the approach is direct - inject the existing tag list into the prompt with an explicit instruction to prefer existing tags and only introduce a new one when nothing fits. Structured output handles the rest. Spring AI makes this straightforward: define the output schema, and the model returns something parseable rather than something creative. Set temperature near zero. You want deterministic selection, not variation.

For larger vocabularies, pre-filtering the injected list via embedding similarity against the product description keeps the prompt manageable without losing coverage.

The model's job is no longer to invent a taxonomy. It's to place a product within one that already exists and extend it conservatively when needed. That's a fundamentally different - and easier - task.

The cold-start problem nobody names

Here's the part that doesn't appear in most posts on this topic: every new store starts with zero tags. There's no vocabulary to inject and no taxonomy to constrain against.

This isn't a minor edge case - it's the state every new customer is in when they connect. And it's where the naive approach fully breaks down, because the fix for consistency (vocabulary injection) depends on having a vocabulary.

The approach that makes sense architecturally: seed first, bulk-classify second. Take a representative sample - enough to cover the catalog's main categories - run it through the pipeline without constraints, review and normalize the resulting vocabulary, then use that curated list as the constraint for bulk classification of the rest. You're building the taxonomy before you rely on it.

The human effort sits at the seeding stage, not spread across the full catalog. You review a vocabulary, not individual products.

In short

Product tagging looks like a prompt engineering problem. It isn't. The consistency problem is architectural. The fix is treating your tag vocabulary as a first-class input to every LLM call - and building a pipeline that maintains and enforces that vocabulary across a catalog, not just across a single product call.

The model is rarely the bottleneck. The pipeline around it usually is.

How are you handling tag consistency in your own systems? Prompt injection, embeddings, a fixed taxonomy - curious what's actually working in production.