Ali Afana

Posted on May 11

I Was About to Rewrite My Chat Router. The Bug Was Two Lines in a Prompt.

#ai #llm #debugging #webdev

Marketing copy mistaken for product catalogs

TL;DR: A customer asked my AI sales bot "what do you have?" and the bot listed product categories the store doesn't sell. My instinct was to rewrite the search router. I spent twenty minutes about to do exactly that. Then I traced where the hallucinated category list was actually coming from: not the search results, not the database, not the router. It was coming from the store's "About" text — which the system prompt was injecting as Store: ${store.description}. The model read that label as a catalog header and treated the marketing copy as inventory truth. The fix was renaming one variable string from Store: to About the store (brand voice / background — NOT a product catalog): and adding one CRITICAL rule. Zero changes to the architecture.

The Bug

I run a multi-tenant AI sales chatbot platform. One of the test stores sells men's casual clothing — shirts, pants, the basics. Its description field, the marketing blurb the merchant types on signup, reads something like:

"Modern men's wardrobe. From sharp business shirts to weekend essentials, suits, shoes, and everything in between."

Standard SEO-friendly copy. Reads fine on the storefront page.

A test customer asks the chatbot:

"Hey, what do you have?"

The bot replies:

"We've got a full men's wardrobe — business shirts, weekend essentials, suits, shoes, and everything in between. What are you in the mood for?"

Customer:

"Great, I'll take a suit."

The store has zero suits. Has never sold a suit. The product table has thirty-four rows; none of them are suits. The bot just promised something the catalog can't deliver. The customer escalates, asks for sizing, and now there's a trust break two messages into the conversation.

I have seen this kind of bug before. I had a whole architecture in place to prevent exactly this.

The Architecture I Was Sure I'd Have to Rewrite

When the customer's message hits a generic phrase like "what do you have" or "show me everything," my chat router doesn't call a freeform "describe the store" prompt. It branches into a dedicated path that pulls the actual product table, builds a category breakdown — { "Shirts": 18 items, $20-$60 }, { "Pants": 12 items, $30-$80 } — and feeds that into the response model as the source of truth.

The architecture is deliberate. I wrote about it before: prompt engineering controls tone, architecture controls behavior. If you want the model to never invent a product, don't beg it not to; give it search results and a tool contract that says "you can only reference what came back from this call." The grounded-LLM playbook.

So when I saw the bot recite suits and shoes for a store that has neither, my first instinct was the obvious one. The architecture must have broken. Either:

The generic-phrase detection isn't firing, so we're falling through to the freeform path where hallucinations are possible.
The category breakdown is returning wrong data — maybe pulling from another store, maybe miscategorizing.
The search results are being clobbered somewhere between the SQL and the response prompt.

I started reading the router code with the intent to rewrite it. I had a branch open and a commit message half-typed before I stopped and did one thing first: I read the actual system prompt that was being sent to the model.

Where the Suits Came From

This is the relevant slice of the response-call system prompt as it was being assembled:

const desc = store.description ? ` Store: ${store.description}` : "";
const typeText = store.store_type ? ` Type: ${store.store_type}.` : "";
const countryText = store.country ? ` Location: ${store.country}.` : "";

const systemPrompt = `
You are the sales assistant for ${store.name}.${desc}${typeText}${countryText}
Search results for "${query}":
${searchResults}
...
`;

Look at the line that builds desc. The label is the word Store: followed by whatever the merchant typed into their description field.

Now look at what the model sees, in order:

You are the sales assistant for Diwan.
Store: Modern men's wardrobe. From sharp business shirts to weekend essentials, suits, shoes, and everything in between.
Type: Clothing & Fashion.
Location: Palestine.
Search results for "what do you have":
{ category_overview: { "Shirts": 18 items, "Pants": 12 items } }
...

The architectural defense — the real category overview — is there, lower in the prompt. It's correct. It's accurate. But two lines above it, there's another block of text labeled Store: listing categories that look like inventory: "shirts," "suits," "shoes."

The model has to decide which of those two sources to trust. The architecture was correct. The labels weren't.

The word Store: is not specific. The model doesn't know it's marketing copy. It reads exactly like the kind of label that introduces an inventory list, because in training data, structured labels followed by category-shaped text usually are inventory lists. Every Shopify product CSV header. Every catalog JSON. The model is doing exactly what its training pulls it toward.

The marketing blurb wasn't being treated as marketing. It was being treated as a catalog because it had been labeled like one.

The Fix: Two Lines

There was no architectural change. The router stayed. The search results stayed. The category-overview path stayed. Two edits to the prompt construction:

Edit one — relabel the injection:

const desc = store.description
  ? ` About the store (brand voice / background — NOT a product catalog): ${store.description}`
  : "";

The model now reads the description with an explicit epistemic frame. This text exists, but it is brand voice. It is not inventory. There is a different source for inventory below.

Edit two — add a CRITICAL rule to the response prompt:

CRITICAL: When the customer asks what you have / what you sell / your
catalog / "شو عندك" / "إيش عندكم" "What do you have — list ONLY categories that appear
in the search results. NEVER enumerate categories from the store
background or description text. The background is brand voice; the
search results are inventory truth.

That's the entire fix. Same architecture, same database, same router branches, same tool contract. The bug closed. The bot stopped offering suits the store doesn't sell.

Architecture vs Prompt Is the Wrong Dichotomy

There's a clean-sounding mental model that goes: "if the bug is the model behaving badly, change the architecture; if the bug is the model sounding wrong, change the prompt." I've written and quoted versions of that myself.

It's not wrong, exactly. It's just not the right axis when you're sitting in front of an actual bug, three minutes from typing git checkout -b rewrite-search-router.

A better question to ask first:

Where, in the bytes I send the model, does the wrong information live?

Not "is my architecture sound." Not "is my prompt strict enough." Where, literally, on the screen, are the suits coming from?

In my case, the suits were in the prompt — in a string I'd inserted myself, with a label that the model was perfectly entitled to interpret as a catalog. The architecture was clean. The search was clean. The defense was clean. I just hadn't been careful about what frame I gave the model for each block of context I passed in.

The general pattern, which I now check on every grounded-LLM bug:

Trace the output back to a span of bytes in the prompt. Not metaphorically — literally find the substring the model echoed. Is it from searchResults? From store.description? From an example in a few-shot block? From an old conversation summary you forgot was being passed?
Look at the label that introduces that span. Store: is not a label, it's a noise word. About the store (brand voice / background — NOT a product catalog): is a label. Specificity here is grounding.
Check whether another span in the same prompt contains the correct answer. If yes, the bug is precedence, not absence. The model has both truths in front of it and picked the wrong one because the wrong one had higher epistemic weight from its labeling.
Only then ask if the architecture needs changing. Usually it doesn't.

The first time I ran this checklist, the "two-line fix" only existed because I'd already written the architectural defense months earlier. The category-overview path was the truth I needed the model to use. The prompt was just calling something else "Store:" right above it and letting the model decide.

The Inversion

I've published before that prompt engineering controls tone and architecture controls behavior. That's still true. But there's a second half I want to write down, because I keep relearning it:

Architecture builds the truth. The prompt decides whether the model believes it.

You can have a flawless retrieval pipeline, a tool contract, a typed search result, a JSON-mode response constraint — and the model will still output a hallucination if the prompt above the truth says, in any voice, "here's the inventory" while pointing at the wrong block.

The two layers aren't in opposition. They're stacked. Architecture is what you make available to the model. The prompt is how you label what you made available. If the labels are vague, the model fills in the meaning from its training, which usually means it picks the most common interpretation — and the most common interpretation of Store: followed by category-shaped prose is "this is the store's inventory."

When the bug looks architectural, check the prompt. When the bug looks like a prompt problem, check what context is reaching the model. The bug almost always lives at the seam between the two, not inside one of them.

The Takeaway

You don't have to choose between "fix the architecture" and "fix the prompt." That dichotomy will burn afternoons.

Ask one question before you reach for either tool: where, in the bytes I'm sending, does the wrong answer come from?

For me, it was the marketing description. Wearing a catalog label. Sitting two lines above the real catalog. The model wasn't wrong to read it that way. I was wrong to label it that way.

The fix was a string rename. Twenty minutes of diagnosis, eight characters of code. The architecture I almost rewrote was already correct.

Top comments (14)

Max Quimby • May 12

This is the single most under-appreciated debugging skill for LLM systems: before you touch any code, point at the exact span of text the model is reading and ask whether you would draw the same conclusion from it. I lost a full afternoon last month on a similar one — a system prompt that said "Customer:" before each turn, then later "User profile:" in front of metadata. The model started attributing the user's bio to the customer's intent. Same shape as your Store: label collision.

The label-as-contract framing is gold. We started running a habit where every prompt field that touches the model has to answer two questions explicitly: what kind of thing is this (catalog, voice, history, hint), and what is the model forbidden to do with it. It feels heavy at first but it kills exactly this category of bug. Curious whether you've added eval cases for the regression — once you've seen a "suits" hallucination once, you basically have a free seed for an adversarial test set.

Ali Afana • May 12

The "Customer:" / "User profile:" collision is exactly the same shape — different surface, identical bug. Two labeled fields competing for the model's attention, the more specific frame loses. Adding it to my mental catalog.
The two-question contract is the part that generalizes hardest. My article's step 2 ("look at the label that introduces the span") was diagnostic — it tells you where to look once the bug exists. Your contract is prescriptive — it tells you what to write so the bug never gets written in. The "what kind of thing is this" half is what I patched with the rename. The "what is the model forbidden to do with it" half is what I didn't have, and that's the half that kills the bug class, because it forces you to articulate the failure mode at the moment of injection, when the labeling decision is still hot. Xidao's identity-vs-reference section split (thread above) is the structural twin: sections give you the slots, contracts give you the spec for what each slot is allowed to mean.
On eval cases: you're right, and I owe my future self this. No harness yet. But I think the seed isn't literally "suits" — it's the shape "marketing prose containing noun-list category language injected into a labeled field." Generalize that and you have a test class, not a test case. The wrinkle (chewing on it in Xidao's thread) is that the adversarial inputs aren't from attackers — they're from merchants writing well-intentioned SEO copy. The accidentally-adversarial shape is the production distribution, which makes synthetic generation harder than it looks. The real test set has to be drawn from real tenant data.

Xidao • May 14

This is a really great debugging story and highlights something I've run into as well — the semantic weight of variable labels in system prompts is massively underestimated.

I had a similar experience where I was injecting user profile data with a label like User preferences: ... and the model started treating it as hard constraints rather than soft hints. The fix was almost identical to yours: relabeling it to something that communicated the right level of authority.

One pattern I've started using is a kind of "role annotation" for each injected context block — instead of just Store:, I use something like [CONTEXT: brand description, not inventory] or [GROUNDING: product table from database]. It adds a few tokens but makes the intended interpretation much more explicit. Some models handle this better than others, though — I've noticed Claude tends to respect these annotations more reliably than some open-source models.

The broader lesson here is that prompt debugging really is a different discipline from code debugging. When the output is wrong, the instinct is to look at the architecture. But sometimes the architecture is fine and the "configuration" (i.e., the prompt) is the bug. Thanks for sharing this — it's a great case study for anyone building grounded LLM systems.

Ali Afana • May 14

The bracketed role annotation is the next layer down from rename-the-label — it makes the role machine-readable in a way Store: never was, and gives you a hook for tooling later (grep prompts for [GROUNDING:] and verify each one is sourced from a real retrieval call).
The Claude-vs-open-source observation tracks. My working hypothesis: instruction-following on negative constraints ("this is NOT inventory") correlates with the depth of RLHF on multi-turn adherence. Older open-source models read [CONTEXT: not inventory] and still happily pull inventory from it because the negative scope blurs into general context. Frontier models hold the boundary more reliably. Curious whether you've seen the delta widen on long context (10K+ tokens), where I'd expect even the well-tuned models to start leaking.
On the closer — "prompt debugging is a different discipline" — agreed, with one refinement: it's a discipline that includes code debugging at the seam where the prompt is assembled. The bug here lived in TypeScript — a template literal with a vague label. The model just surfaced it. Code review doesn't catch "this label is too ambiguous" because there's no compile error. The bracketed annotations are exactly the move that makes the ambiguity inspectable.

Xidao • May 12

This is a great writeup of a class of bug that is really underappreciated in LLM-powered products. The variable labeling issue you describe — where the model treats any labeled context as structured truth — is something I have hit multiple times in multi-tenant setups.

The fix of renaming the label to explicitly disclaim its purpose is clever and pragmatic. One additional pattern I have found helpful is splitting system context into two clearly separated sections: one for "identity and rules" (always authoritative) and one for "reference data" (explicitly marked as non-authoritative background). Even then, some models will still occasionally blur the boundary, so I tend to add a lightweight output validation layer as a second line of defense.

Do you have a set of adversarial test cases for this kind of context injection ambiguity, or was this caught purely through manual testing? Curious how you approach regression testing for prompt-level bugs like this.

Ali Afana • May 12

The identity-and-rules vs reference-data split is cleaner than what I did. I fixed at the label level; you're describing the fix at the section level. The label rename worked because there was one offending injection. The section split is what survives once you're injecting five or six context blocks and any of them could quietly read as authoritative.
Honest answer on testing: this was caught manually — and only because I'd developed a habit of reading the literal prompt the model sees before touching the architecture. The "CRITICAL: list ONLY categories from search results" rule is the regression patch, but it's a static guard in the prompt, not a test suite. No adversarial harness yet. Closest thing is dogfooding against real merchant descriptions in dev mode, which catches some of these and obviously misses the long tail.
One wrinkle I keep getting stuck on with adversarial testing in multi-tenant: the reference data isn't mine. It's whatever the merchant types into their description field. Even with a clean section split, the words in the reference section can be accidentally adversarial — a merchant writes "we have everything from suits to shoes" as marketing copy and the model reads it as inventory. The injection isn't malicious; it's shaped like inventory because Shopify-style descriptions tend to be.
How do you build the adversarial set when the "adversaries" are also your tenants providing well-intentioned data? Pull from real tenant descriptions, or synthetic generation?

Leo Pessoa • May 17

The framing — "architecture builds the truth, prompt decides whether the model believes it" — is exactly right. It points at a deeper problem: prompt labels are runtime metadata that competes with data, and label ambiguity is invisible until the model makes the wrong choice. When a domain object defines its own typed fields (name, price, category) rather than injecting a freeform description block, the schema itself becomes the epistemic frame — there's no ambiguity because the structure already defines what each value is. Your diagnostic question ("where in the bytes sent to the model does the wrong answer originate?") is solid; the next level is pushing role-definition upstream into the data model so you never need to ask it at debug time. That's the principle exomodel.ai is built on — schema as ground truth, not prompt labels.

Ali Afana • May 17

Schema-as-epistemic-frame is right in pure form — typed fields don't need labels because the structure already declares the role. The wrinkle is that the bug class doesn't live inside the schema; it lives at the seam where freeform text fields get injected into prompts.
store.description in my case is a typed TEXT field. The schema knows it's a description. The model doesn't — once the value gets stringified into a prompt, it's just prose with whatever label you wrap it in. Same shape reappears with product descriptions, return policies, FAQ entries — anywhere a user supplies freeform text that has to be surfaced to the model. You can push role-definition upstream right up to the boundary where prose enters the prompt; past that point, you're labeling again.
So less "schema vs prompt labels" and more "schema decides what fields exist; labels decide how their values are framed when they cross into the model's context." Both layers, both load-bearing.

Leo Pessoa • May 17

Fair refinement — and it sharpens the boundary precisely. Schema decides structure; labels decide framing at injection time. Both load-bearing, agreed.

Where it gets interesting: the seam you're describing (typed field → stringified prose → model has no idea what it is) is really a serialization problem. The schema knows store.description is a description; the serializer forgets it. So the question becomes whether the serialization layer can carry that semantic signal forward — not just the value, but its role.

One approach: the field itself participates in prompt construction based on its own definition, not a wrapper label applied externally at call time. That compresses the problem — when something breaks, you fix the field declaration, not the template.

The layer I underplayed earlier: you can also attach documents directly to the object — RAG scoped to the model itself, not bolted onto the prompt pipeline. For freeform fields specifically, that's where domain-specific grounding lives: not a label that says "this is a description," but actual domain knowledge that informs how the value should be interpreted. Schema defines structure; attached corpus defines epistemic context. Both layers, yes — but the second doesn't have to be a prompt label.

Ali Afana • May 17

That's the cleaner frame. The bug really is a serialization problem — the schema knows the role of every field; the template forgets and has to be told again by hand. Every "Description:" label in the prompt is a manual recovery of information the type system already had.
Field-level prompt participation is the right architectural shift. The field knows what it is. The field generates its own framing. The bug class I hit becomes a schema validation problem, not a "did I label this right in the template" problem. That moves the failure surface from "across N prompts" to "in one field declaration."
Where I want to push on the freeform case: attached corpus solves grounding — the model has authoritative product data to reason against. But it doesn't solve which fields the model should ground responses in. A store description is structured by schema, but isn't corpus, and isn't authoritative ground truth either. It's user-authored marketing prose the model has no signal to discount.
The thing I fixed at the prompt level was epistemic status, not just framing. The label I added was effectively "treat this field as decorative; do not infer product claims from it." That's a field-level declaration the schema should be carrying — neither pure structure nor attached corpus, but adjacent to both.
Three layers, then, if I'm following the argument: schema declares structure, attached corpus declares domain context, and the schema also needs to declare epistemic status per field — authoritative vs decorative. The bug I hit was the model treating a decorative field as authoritative. Attached corpus would give it better ground truth, but wouldn't tell it which field NOT to ground from.

Vikrant Shukla • May 12

This is the most underrated debugging lesson in LLM systems: "infra-looking" symptoms are almost always prompt or schema issues in disguise. Every time we've been tempted to rip out an LLM router or vector store, a postmortem found the real cause in the prompt contract — ambiguous role descriptions, conflicting few-shots, or an instruction that quietly contradicted a tool spec. Two practices that save us hours now: (1) golden traces — store the exact (prompt, model, params, response) for every failure class and diff against current behavior on each prompt change, and (2) treat prompts as code with versioning, code review, and CI evals. The two-line fix you found will reappear in someone else's system next quarter if it's not enforced structurally.

Ali Afana • May 12

"infra-looking symptoms are almost always prompt or schema issues in disguise" — stealing that as the article's missing TL;DR.
What I'm taking from your comment: the 4-step checklist I ended on is the diagnostic layer. It catches the bug when you're staring at it. Golden traces + prompts-as-code is the prevention layer — it stops the next mislabel from reaching prod at all. They stack rather than compete. The article only covers diagnostics because that's where I currently am in my practice. Every postmortem I've published on Dev.to is essentially a failure-class snapshot in prose form — the manual version of what you've automated.
One thing I keep getting stuck on with eval-on-prompt-change for grounded LLMs: what's the diff criterion? Exact response match is too brittle (temperature noise, paraphrase drift). Structural match — "categories mentioned, tools called, hallucinated entities = 0" — feels closer, but it's harder to define generically across domains. How do you draw the line in your golden traces?

Vikrant Shukla • May 12 • Edited

The trick is not picking one criterion — it's splitting the trace into three things and diffing each differently.
Deterministic structure (tool name, arguments, schema, retrieved IDs) > exact match. Most regressions land here before prose even matters.
Grounded content > set-membership against retrieval, not against the golden response. Your suits bug is a one-liner: mentioned_categories ⊆ retrieved_categories. Six to ten invariants like that probably cover most of a catalog product's failure surface.
Prose > rubric-scored judge model as a soft signal, never a gate. Skip embedding cosine; "suits" and "shirts" are neighbors in vector space, which is exactly the distance you need to resolve.
One thing I'd push on regardless: don't diff a single sample. Sample N at prod temperature and track invariant pass rate as a distribution a prompt edit that moves hallucination from 1% to 6% looks clean on any single diff.
Your checklist already encodes this; CI just needs the span roles (catalog / voice / history / hint).

Ali Afana • May 12

mentioned_categories ⊆ retrieved_categories is the article's diagnostic step rewritten as a one-line invariant — exactly the upgrade from "where did the bytes come from" to "is the output bounded by valid sources." Banking that.
The cosine point is the one that subverts a default I would have reached for. The whole bug class lives at distances embedding similarity treats as "the same" — "suits" and "shirts" are neighbors in vector space because they're neighbors in the world, and that's exactly when the hallucination is most expensive to catch. Rubric-scored judge as soft signal is going on my list.
Distribution-over-samples is the one I haven't lived yet. Provia isn't in front of real merchants at scale, so today's "regression surface" is dogfood plus a small set of seed stores. The 1%-to-6% drift you're describing is invisible from where I'm standing — and that changes the moment the first paying tenant is on, which is when this stops being an article I'm bookmarking and starts being infrastructure I have to actually build.
Reading you next to Xidao's section split and Max's label contracts (other threads here), the convergence is striking — span roles, sections, contracts. Three different vocabularies for the same primitive: every piece of context the model sees needs an explicit role and an explicit forbidden-action. Your CI version is what enforces it at scale; the prompt-level version is what gets you to the starting line.

View full discussion (14 comments)